1. Field of the Invention
The present invention relates to a SIMD (Single Instruction stream Multiple Data stream) microprocessor configured to process plural image data pieces in parallel using a single operation instruction, and a data transfer method for use in the SIMD microprocessor.
2. Description of the Related Art
Image data handled by digital copiers and the like are generally a collection of data pieces arranged in two dimensions. The individual data pieces constituting an image are called pixels.
Each pixel has an assigned value, which determines the content of the image. When pixels with value “1” representing black and pixels with value “2” representing white are used, for representing an image, the image is represent in only two colors, namely, black and white. For representing intermediate colors, a pixel of 4-bit data may be used for example, which can represent 16 colors corresponding to values from 0000b to 1111b (the “b” indicating binary notation). Thus, 14 intermediate colors can be represented between black and white. If a pixel of 8-bit data is used, 256 colors can be represented.
The size of pixel data varies depending on the intended use or the content of the image. For instance, pixels of a large number of bits are used for images requiring a fine expression such as photographs, while pixels of a small number of bits are used for images requiring small data size such as images used in communications.
SIMD microprocessors are often employed for processing image data. The SIMD processors are suitable for image processing because they can perform the same arithmetic operations on plural data pieces at the same time with a single instruction. A typical SIMD microprocessor includes plural processor elements (hereinafter referred to as “PEs”) each having an arithmetic circuit and a register. The SIMD microprocessor causes, with a single instruction, these PEs to perform the same arithmetic operations on plural data pieces at the same time. Each PE is generally designed to process a single pixel of an image when processing the image.
In recent years, there have been demands on image processing for increasing the processing speed and improving the image quality. The image processing speed of SIMD processors may be increased by either one of two approaches. One is to increase the operating frequency of the processor, and the other is to increase the number of pixels processed at the same time.
Increasing the operating frequency has been a constant demand, and it is not easy to achieve a further significant improvement in the operating frequency. Increasing the number of pixels processed at the same time may be generally achieved by increasing the number of PEs. Increasing the number of PEs, however, results in greater circuit size and lower operating frequency.
Meanwhile, improving the image quality means increasing the number of colors or gray levels of pixels, resulting in increasing the size of pixel data. For example, the size of pixel data is increased from 8 bit for 256 gray levels to 16 bits for 65536 gray levels. If the size of pixel data is increased, the operation data size in each PE needs to be increased.
As can be seen, a variety of demands are imposed on SIMD processors, such as improving the operating frequency, increasing the number of PEs, and increasing the operation data size in each PE.
Japanese Patent Laid-Open Publication No. 2006-260479 discloses a SIMD microprocessor that realizes an increase of the number of PEs and an increase of the operation data size. The SIMD microprocessor of Patent Document 1 is of a layered type in which each PE includes plural arithmetic circuits. This SIMD microprocessor can operate in a mode for processing reduced size pixels using an increased number of PEs or a mode for processing increased size pixels using a reduced number of PEs.
FIG. 8 illustrates an exemplary configuration of related-art PEs 110. Each PE 110 includes a register (REG) 111, a PE shifter (PSH) 112, a bit shifter (BSH) 113, an ALU (L) 114a, and an ALU (H) 114b).
The register 111 temporarily stores data to be operated on in the PE 110. In the example of FIG. 8, in order to process 8-bit pixels and 16-bit pixel, a 16-bit register as the register 111, which can be split into two 8-bit registers, is provided one for each PE 110.
The PE shifter 112 selects data pieces from the register 111 in the current PE 110 and the register 111 in the adjacent PE 110 and transfers the selected data pieces to the bit shifter 113. That is, data pieces are shifted among the PEs 110. The PE shifter 112 of FIG. 8 includes 7-to-1 multiplexers 112a in order to refer to data in previous three and subsequent three continuous pixels. In the case of 16-bit data, data pieces in a PE 110 are shifted (transferred) as they are. In the case of 8-bit data, either one of the following two transfer methods is used. One is for the case where the priority in data arrangement is given to the arrangement order of the PEs 110. This method transfers data pieces in the same manner as in the case of 16-bit data. The other is for the case where the priority is given to the arrangement order in the PEs 110. This method requires data transfer in each PE 110. Therefore, 2-to-1 multiplexers 112b are provided at the subsequent stage of the 7-to-1 multiplexers 112a in the PE shifter 112.
The bit shifter 113 performs bit shift and bit extension of data. Because an ALU requires double-precision arithmetic capacity with respect to the values in the register 111, 16 bit data are extended to 32-bit and 8-bit data are extended to 16-bit. After converting data into double precision data by using a 16-to-1 multiplexer 113a for 16-bit data and an 8-to-1 multiplexer 113b for 8-bit data, either one is selected. Then, lower-order 16 bits are transferred to a lower ALU (L) 114a, while higher-order 16 bits are transferred to a higher ALU (H) 114b. 
The ALU (L) 114a and the ALU (H) 114b are Arithmetic and Logic Units (ALUs) each configured to perform 16-bit arithmetic operations. Although the ALU (L) 114a and the ALU (H) 114b can perform arithmetic operations independently from each other, the ALU (L) 114a and the ALU (H) 114b can be linked to operate as a 32-bit ALU 114.
In the PE 110 having the above-described configuration, data read from the register 111 are transferred to the ALU (L) 114a and the ALU (H) 114b via the PE shifter 112 and the bit shifter 113.
A global processor 120 is a controller for controlling operations of the PEs 110 and is an independent processor for executing reading of programs. The global processor 120 includes various registers and a memory for storing data.
FIG. 9 illustrates another exemplary configuration, wherein each PE shifter 112 includes 11-to-1 multiplexers 112c. In this configuration, the number of inputs that can be selected is increased in order to select data of the previous and next three pixels in both cases where the priority in data arrangement is given to the arrangement order of the PEs and where the priority is given to the arrangement order in the PEs 110. It is difficult to generally determine which configuration is better in terms of circuit size and the operating speed, the configuration of FIG. 8 for performing shifts using two steps or the configuration of FIG. 9 for selecting many inputs and performing shifts all at once.
As described above, methods for manipulating the pixel size (the number of bits) and the number of PEs by enabling splitting in a SIMD microprocessor have been disclosed. However, a selector switch for realizing such an operation is added, resulting in increased circuit size and reduced operating speed.