The present invention relates to a parallel processing processor and a parallel processing method. More particularly, the invention relates to a parallel processing processor and a parallel processing method for use therewith, the processor comprising a facility for processing in a dedicated structure xcex1 data representing the transparency of images, whereby image processing is performed at high speed.
Standards for digital moving pictures coding specify ways to divide a picture into blocks of a specific size each, to predict motions of each block and to code predictive errors per block. Such block-by-block processing is carried out effectively by software when the latter is run on a processor capable of performing operations on numerous pixels in response to a single instruction. Among many definitions of processor, those published by Flynn in 1966 are well known and widely accepted (xe2x80x9cVery high-speed computing systems,xe2x80x9d Proc. IEEE, 12, 1091-9; Flynn, M. J., 1966). Processors, as defined by Flynn in the cited publication, fall into four categories: SISD (single instruction stream-single data stream) type, MIMD (multiple instruction stream-multiple data stream) type, SIMD (single instruction stream-multiple data stream) type, and MISD (multiple instruction stream-single data stream) type. The processor suitable for the above-mentioned block-by-block processing belongs to the SIMD type. According to Flynn, the SIMD type processor is characterized in that xe2x80x9cmultiple operands are executed by the same instruction stream (ibid.).xe2x80x9d
Discussed below is how picture coding algorithms are processed illustratively by software using the SIMD type processor.
A typical algorithm to which the SIMD type processor is applied advantageously is motion compensationxe2x80x94a technique for removing correlations between frames that are temporally adjacent to one another. MPEG1 and MPEG2, both international standards for moving picture coding, embrace a technique called block matching for motion compensation.
Block matching, what it is and how it works, is outlined below with reference to FIG. 2. FIG. 2 is a conceptual view explaining how block matching is typically performed. A current frame 21 is a frame about to be coded. A reference frame 22 is a frame which is temporally close to an image of the current frame and which represents a decoded image of a previously coded frame. To effect block matching requires utilizing a luminance signal alone, or employing both a luminance signal and a chrominance signal. Where software is used for block matching, the luminance signal alone is generally adopted in view of relatively low volumes of operations involved. The description that follows thus assumes the use of the luminance signal alone.
The current frame 21 is first divided into blocks of the same size as indicated by broken lines (each block generally measures 16xc3x9716 pixels or 16xc3x978 pixels). A temporal motion between the current frame 21 and the reference frame 22 is then detected for each block. A block 23 is taken up here as an example for describing how a temporal motion is specifically detected. A block 24 is shown spatially in the same position in the reference frame as the block 23 in the current frame. A region represented by the block 24 is moved, its size unchanged, to the position of a block 25 at an integer or half-pixel resolution. Every time such a motion take place, a summation is made of each absolute value of the difference between the blocks 24 and 23 regarding all their pixels. The process is carried out on all motion patterns that may be defined in a predetermined search range (e.g., from pixels xe2x88x9215 to +15 horizontally and pixels xe2x88x9215 to +15 vertically to the block 24). The motion from the block 24 to the block position representing the smallest summation of each absolute value of the difference therebetween is detected as a motion vector. For example, if the block 25 turns out to be the block representing the smallest summation of each absolute value of the difference, then a vector 26 is detected as a motion vector.
While indispensable for coding, block matching is a technique that requires a huge number of pixel-by-pixel operations (subtractive operations, absolute value operations, additive operations). Illustratively, if the picture size is 176xc3x97144 pixels and the block size is 16xc3x9716 pixels, the number of divided blocks is 99. In such a case, there are 289 search block patterns for each block provided the search range for block matching is set for xc2x18 pixels at an integer pixel resolution. It follows that each of the above-mentioned three types of operation needs to be carried out 289xc3x9799xc3x97256 times (i.e., the number of intra-block pixels). If the picture size is that of the standard television (SDTV), or if the motion search range needs to be enlarged illustratively to accommodate sports-related images, or if the pixel resolution needs to be maintained at a high level during the search, the volume of necessary operations will have to be increased tens- to hundreds-fold. For these reasons, it used to be general practice to employ dedicated hardware for executing block matching. Today, however, advances in processor technology and the emergence of simplified block matching techniques have made it possible for a general purpose processor to carry out the block matching process. As mentioned earlier, SIMD type processors are used advantageously to perform block-by-block processing such as block matching.
A conventional SIMD type parallel processing processor will now be described with reference to FIG. 3. FIG. 3 is a block diagram of a conventional parallel processing processor. The processor works as follows: instructions to be executed are first sent from an external memory 130 to an instruction fetch circuit 110 over a processor-to-main memory bus 180. The instruction fetch circuit 110 includes an instruction memory for instruction storage, a program counter, and an adder for controlling the address in a register in the program counter. The instruction fetch circuit 110 supplies an instruction decoder 120 with the received instructions in the order in which they are to be executed. Every time an instruction is received, the instruction decoder 120 decodes it to find such information as the type of operation, a read address and a write address. The information is transferred to a control circuit 140 and a general purpose register 150. Each instruction is then processed by the general purpose register 150, a SIMD type ALU 160 and a data memory 170 according to control information (141, 142, 143) from the control circuit 140. For purpose of simplification and illustration, it is assumed that the parallel processing processor shown in FIG. 3 has four SIMD type ALUs for concurrent processing of four pixels.
Described below is typical processing of block matching by use of the C language and an assembler code.
A C code 1 below is an example in which a block matching algorithm for a block is described in C language. It is assumed that the block size is 16xc3x9716 pixels and that a vector (vec_x, vec_y) is detected as representative of a motion vector when a value xe2x80x9cerrorxe2x80x9d becomes the smallest.
C code 1: an example of block matching
where, xe2x80x9cfor""sxe2x80x9d are statements in which to describe the loops in C language. The two outer xe2x80x9cforxe2x80x9d statements specify loops for a search range of 16xc3x9716 pixels vertically and horizontally; the two inner xe2x80x9cforxe2x80x9d statements designate loops in which to obtain differences of image data within a block; xe2x80x9ccurrentxe2x80x9d stands for image data about the current frame with respect to an argument; and xe2x80x9creferencexe2x80x9d denotes image data on the reference frame.
An assembler code 1 shown below represents in an assembler code format the expression (abs(current(x+j, y+i)xe2x88x92reference(x+j+vec_x, y+i+vec_y))).
Assembler code 1: representative of additive expression to obtain motion vector
where, LOAD stands for a data transfer instruction for transferring data from the external memory 130 to the general purpose register 150, SUB for a subtractive arithmetic instruction (R0=R1xe2x88x92R2), and ABS for an absolutization arithmetic instruction (R5=|R0|).
How data operations take place with the assembler code 1 above in use will now be described with reference to FIGS. 3 and 4. FIG. 4 is a schematic view outlining how data operations are carried out conventionally by ALUs. In FIG. 4, a left-pointing arrow indicates reading of data from a register to the ALUs, and a right-pointing arrow denotes writing of data from the ALUs to a register.
Two LOAD instructions are first used to write data on the current and reference frames in the order from the external memory 130 to the data memory 170 in FIG. 3. The data written to the data memory 170 are loaded into registers R1 and R2 in accordance with write register information from the instruction decoder 120 (R3 and R4 are base registers designating positions of pixels in the frames).
A subtractive arithmetic instruction is then used to read the data from the registers R1 and R2 to the SIMD type ALU 160 in keeping with read register information from the instruction decoder 120. At the same time, the SIMD type ALU 160 acquires from the control circuit 140 ALU control information 142 that determines the type of operation. In this case, the type of operation is found to be subtractive. In the SIMD type ALU 160, a data demultiplexing circuit 161 demultiplexes the acquired information into four items of pixel data (g1 through g4) and (p1 through p4) as indicated by reference numerals 401 and 402 in FIG. 4. The data demultiplexing circuit 161 is wired in such a manner that the contents of the designated general purpose register are divided for input into four ALUs. After the demultiplexed data are assigned to the four ALUs 162a through 162d in FIG. 3, the pixel data items are each subjected to a subtractive operation by arithmetic elements 403a through 403d in FIG. 4. Following the operation, a data multiplexing circuit 163 in FIG. 3 multiplexes the resulting data. The result of the operation is placed into a register R0 in accordance with the write register information from the instruction decoder 120. The data multiplexing circuit 163 is wired in such a manner that the outputs of the four ALUs are combined for input into a single general purpose register.
An approximately similar process takes place with the absolute value operation. Data in the general purpose register R0 are first read into the SIMD type ALU 160 in keeping with the read register information from the instruction decoder 120. Simultaneously, the SIMD type ALU 160 acquires from the control circuit 140 ALU control information designating an absolute value operation. In the SIMD type ALU 160, the data demultiplexing circuit 161 demultiplexes the acquired information into four items of pixel data g1-p1 through g4-p4 as indicated by reference numeral 404 in FIG. 4. After the demultiplexed data are assigned to the four ALUs 162a through 162d in FIG. 3, the pixel data items are each subjected to an absolute value operation that provides absolute value data as indicated by reference numeral 405 in FIG. 4. Following the operation, the data multiplexing circuit 163 in FIG. 3 multiplexes the resulting data. The multiplexed data are placed into a register R5 in accordance with the write register information from the instruction decoder 120. Timing control for the processing above is provided by the control circuit 140.
The SIMD type processor, fit for repetitive operations as mentioned earlier, works in a most advantageous structure when carrying out block matching wherein the same operation is repeated on the pixels in a block.
As explained, parallel processing processors provide a viable technique for boosting the throughput of such image processing as block matching. Meanwhile, in a field of computer graphics where images of objects made up of arbitrary shapes (not just rectangular) are processed, it is now common practice to furnish each pixel in color space with what is known as xcex1 data representative of pixel transparency. To perform motion prediction such as that in block matching on images containing xcex1 data requires carrying out pixel-by-pixel data masking. The requirement tends to increase the amount of processing performed by SIMD type arithmetic and logical operation units, resulting in the throughput being impeded.
What follows is an outline of the significance of xcex1 data and of a block matching algorithm taking the xcex1 data into consideration. How the amount of necessary processing is bound to increase will then be described in more detail.
FIG. 5 is a schematic view showing relations between a frame and a bounding box. The block matching process described earlier is a technique that applies to rectangular images. In recent years, however, efforts have been made to handle images of arbitrary shapes in the framework of image coding; arbitrarily shaped images used to be dealt with primarily in the field of computer graphics. Each image of an arbitrary shape comprises shape information in addition to color information (sampling planes (e.g. Y plane, U plane and V plane) are included) composed of a luminance signal and a chrominance signal. The shape information is called xcex1 data or xcex1 image. As with a color signal, an item of a data has a value ranging from 0 to 255 (for eight-bit images) representative of the transparency of a pixel. Because of their ability to indicate transparency, the xcex1 data play an important role in displaying a combination of more than two images of arbitrary shapes. That is, color signals denoting the background, persons, characters and other images of arbitrary shapes are superposed one upon another for display in a manner proportional to the values constituting the xcex1 data. The combination of the superposed images makes up xcex1 single display image. Thus the color information about the pixels positioned so that their xcex1 data are zero constitutes pixel information that has no significance in the encoding or decoding of images. This can be a disadvantage if error computations (subtractive and absolute value operations) are performed on all pixels in the block (as in block matching) when block-by-block motion prediction is carried out on arbitrarily shaped images accompanied with xcex1 data. That is, the precision of motion prediction may decrease on a boundary region of arbitrary shapes. It is thus necessary to judge whether each pixel is a significant pixel on the basis of xcex1 data and, in the case of insignificant (i.e., transparent) pixels, to mask the addition of error values to the summation of absolute values (either no addition is performed or 0 is added).
Below is an example in which a motion vector detecting technique is applied to an image in FIG. 5. An object 51 is handled as a rectangular image 52 ready for image processing. The image 52 placed in a rectangular frame is generally called the bounding box. Image coding is carried out in units of such bounding boxes. The size of the box is generally given as a multiple of the block size for coding (usually 16xc3x9716 pixels). What follows is a description of how a motion vector is detected in a block 53 containing a region 54 having color information and a region 55 with no color information (blocks like the block 53 are each called a boundary block hereunder). It should be noted that transparent pixels having no color information possess xcex1 data that are zero values.
In order to implement block-by-block motion prediction taking xcex1 data into account, the above-cited C code 1 for block matching need only be replaced by the C code 2 or C code 3 shown below. This technique for motion prediction is called polygon matching as opposed to block matching. In this case, the xcex1 data may be one of two types: gray scale data constituting a value ranging from 0 to 255 (for eight-bit images), and binary data forming a value of either 0 or 255 (for eight-bit images). The C code 2 below is for binary data and the C code 3 for gray scale data.
The reference frame has no pixel with xcex1 data that are zero. The reason is that when a reconstructed frame is used as a reference frame, it is the encoder or decoder that compensates color information about any pixels having zero xcex1 data in the frame based on the surrounding pixels, the compensation being such that the xcex1 data will become 255 (for eight-bit images).
C code 2: an example of polygon matching with xcex1 data taken into account (in the case of binary data)
C code 3: an example of polygon matching with xcex1 data taken into account (in the case of gray scale data)
In the C code 2 above, the color data about the current frame and reference frame are subjected to an absolute value operation, and the result of the operation is ANDed with each bit of the xcex1 data. It follows that if the xcex1 data are zero, the value to be added is zero regardless of the result of the absolute value operation.
In the C code 3 above, a check is made to see if each xcex1 data item is zero. If the data item is found to be zero, the logical expression becomes true (taking a value of 1), and the logical negation of the expression is zero. If the xcex1 data item is judged to be other than zero, the logical negation of the logical expression is 1. As a result, the value to be added is zero when the xcex1 data item is zero, and becomes the result of the absolute value operation when the xcex1 data item is other than zero.
As described, whether the C code 2 or C code 3 is in use, the polygon matching process with a data taken into consideration involves frequent execution of data masking (i.e., as frequent as per pixel) whereby the data to be added to the error value is replaced by zeros.
It is therefore an object of the present invention to overcome the above and other disadvantages of the prior art and to provide a parallel processing processor for performing image processing involving xcex1 data, the processor having a dedicated function for dealing with the xcex1 data so that the burden of processing shared by parallel execution units is alleviated and that the throughput of the processor as a whole is improved.
In carrying out the invention and according to one aspect thereof, there is provided a parallel processing processor for processing images including xcex1 data indicative of pixel transparency, the parallel processing processor comprising: (a) a plurality of execution units for executing in parallel arithmetic and logical operations under control of a single instruction; (b) general purpose registers which are connected to the execution units via a data path, which input data to the execution units and which receive results of operations from the execution units; (c) xcex1 data dedicated registers which are connected to the execution units via another data path and which input data to the execution units; and (d) controlling means for directing data from the general purpose registers and the a data dedicated registers into each of the execution units under control of a single instruction.
According to another aspect of the invention, there is provided a parallel processing processor of the structure outlined above wherein, under control of a single instruction, the execution units admit data from the general purpose registers to carry out first arithmetic and logical operation on the admitted data and, without returning result of the first arithmetic and logical operation to the general purpose registers, receive data from the xcex1 data dedicated registers to perform second arithmetic and logical operation between the received data and the result of the first arithmetic and logical operation.
According to a further aspect of the invention, there is provided a parallel processing method for use with a plurality of execution units for executing in parallel arithmetic and logical operations under control of a single instruction, the parallel processing method processing images including xcex1 data indicative of pixel transparency and comprising the steps of: (a) inputting a plurality of first data from general purpose registers to the execution units for performing first arithmetic and logical operation in parallel by the units; (b) inputting a plurality of second data corresponding to the first data from xcex1 data dedicated registers to the execution units without returning result of the first arithmetic and logical operation to the general purpose registers, so as to perform second arithmetic and logical operation between the second data and the result of the first arithmetic and logical operation; and (c) outputting result of the second arithmetic and logical operation to a general purpose register designated by an instruction.
Other objects, features and advantages of the invention will become more apparent upon a reading of the following description and appended drawings.