The present invention relates to a Single Instruction Multiple Data (SIMD) processor.
Various techniques have been proposed regarding a SIMD processor that is able to simultaneously execute the same processing on a plurality of pieces of data by one instruction (Published Japanese Translation of PCT International Publication for Patent Application, No. 2010-531502, Japanese Unexamined Patent Application Publication No. 07-219919, International Patent Publication No. WO 2006/049331, and Shorin Kyo “In-vehicle Video Recognition LSI including 128 4-Way VLIW-type RISC core” reported by The Institute of Electronics, Information and Communication Engineers, Committee on Integrated Circuits and Devices (ICD), May, 2003, Vol. 103, No. 89, pp. 19-24: hereinafter referred to as Non-patent literature 1).
FIG. 10 schematically shows a SIMD processor disclosed in Non-patent literature 1. A SIMD processor 10 includes a control processor 20 and a processor array 30. The processor array 30 is a one-dimensionally coupled distributed memory type processor array, and includes N (N: an integer of two or larger) pieces of processor elements. These processor elements are connected in a ring shape, and perform the same processing according to an instruction from the control processor 20. When the control processor 20 sends the instruction to the processor array 30, it is possible to designate processor elements that do not execute processing by a mask bit or a mask flag (hereinafter the term “master flag” is used). Thus, the plurality of processor elements included in the processor array 30 are in one of the states in which they perform the same processing and perform no processing.
In the following description, the control processor is denoted by “CP”, the processor array and the processor element are denoted by “PE array” and “PE”, respectively.
Each of PEs (PE1-PEN) included in the PE array 30 has the same configuration. Thus, the PE1 will be described as a representative example. As shown in FIG. 10, the PE1 includes a local memory 44, a memory controller (MEMCTL) 46, and a calculation unit 48.
The calculation unit 48 executes calculation, and is able to perform data communication with an adjacent PE. The MEMCTL 46 controls a local memory access and an external memory access.
The local memory access is a memory access generated inside the PE array 30, and specifically includes a write request and a read request output from the calculation unit 48. The MEMCTL 46 writes data from the calculation unit 48 according to the write request from the calculation unit 48 into the local memory 44, and reads data from the local memory 44 according to the read request from the calculation unit 48 to supply the data to the calculation unit 48.
Further, upon receiving a memory access from a device outside the PE array 30 (including the CP 20), the MEMCTL 46 writes data that is requested to be written into the local memory 44 when the memory access indicates a write access, and reads out data that is requested to be read to output the data that is read out from the local memory 44 when the memory access indicates a read access.
Such a SIMD processor 10 is especially effective for processing of a data group including a plurality of pieces of data arranged in two dimensions (hereinafter referred to as “two-dimensional data”). The two-dimensional data includes image data including data of pixels in one screen and aggregation of data input to respective cells of a two-dimensional table, and the like. In the following description, image data is used as an example of the two-dimensional data. However, it should be understood that all the description taking the image data as an example may be applied to other two-dimensional data. Further, unless otherwise stated, the terms “pixel” and the “pixel value” are used synonymously.
In typical, the width of an image (the number of pixels in the row direction) is larger than the PE number N. Thus, the SIMD processor 10 divides, as shown in FIG. 11, the image data stored in the external memory into blocks, each having a width of N and the number of rows of M (M: an integer of one or larger), stores the plurality of blocks in the local memory 44 of each of the PEs of the PE array 30, to cause each of the PEs to execute processing.
In typical, the total amount of the capacity of the local memories in the PE array 30 is greatly smaller than the capacity of the external memory. Thus, the number of blocks that may be stored in the local memories at the same time is limited. The methods for storing blocks in the local memories may include two methods of “vertical direction priority” and “lateral direction priority”. Description will be made with reference to FIGS. 12 and 13.
FIG. 12 shows an example of the case of the “vertical direction priority”. In FIG. 12, numbers encircled by small dotted rectangles indicate the block numbers. Further, small rectangles in the local memory 44 indicate pixels. In “A(B,C)” (A, B, C: numbers) in each of the small rectangles showing pixels, “A” indicates a block number, and “(B,C)” indicates the numbers of the column and the row in which the pixel is located in the block, respectively. For example, 1(1,1) indicates the pixel in the first column, the first row in the block 1. Note that (X,Y) coordinates of the pixel are (0,0). Further, the symbol W indicates the width of the image (the number of pixels in the X direction, i.e., the number of columns), and the symbol H indicates the height of the image (the number of pixels in the Y direction, i.e., the number of rows). The same explanation is applied also in each of the following drawings.
The storage method of the “vertical direction priority” shown in FIG. 12 is a method of simultaneously storing pixels in the same column as many as possible in the local memories of the PE array 30. In this case, blocks located on the left side are preferentially stored, and blocks located on an upper side are preferentially stored regarding each block in the same column.
In the example shown in FIG. 12, the image height H is five times larger than the number of rows M of the block. Thus, the number of rows of the block is 5. As shown in FIG. 12, the blocks 1-5 in the leftmost end (first column) in the image data in the external memory are first stored in the order of the blocks 1, 2, 3, 4, and 5, and then the blocks 6-10 which are in the second column from the left are stored in the blocks 6, 7, . . . .
Note that, regarding data in the respective blocks, N pieces of pixels in each row are stored in the same address (hereinafter referred to as a “local address”) of the local memories 44 of N pieces of PEs in the order of rows. For example, regarding the block 1, the pixels (1(1,1), 1(2,1), 1(3,1), . . . , 1(N,1)) in the first row are first stored in the same local address of the local memories 44 of the PE1 to the PEN, respectively. The pixels in the second row are stored in the next local address of the local memories 44 of the same PEs as the pixels subsequent to the pixels in the same column of the first row. For example, the pixel 1(1,2) (not shown) in the first column, the second row of the block 1 is stored in the next local address of the local address of the pixel 1(1,1) in the first column, the first row in the local memory 44 of the PE1.
For example, when the base address BASEADDRESS (the address in which the pixel 1(1,1) is stored) in the local memory 44 is denoted by 0, the local address of each pixel in the first row of the block 1 is “0”, and the local address of each pixel in the M-th row is “M−1”. Further, the local address of each pixel in the first row of the block 2 is “M”, and the local address of each pixel in the M-th row of the block 2 is “2×M−1”. In the similar way, the local address which is in the first row of the block 6 is “5×M”, and the local address of each pixel in the M-th row of the block 6 is “6×M−1”.
FIG. 13 shows an example of a case of the “lateral direction priority”. The storage method of the “lateral direction priority” is a method of simultaneously storing pixels in the same row as many as possible in the local memory of the PE array 30. According to this method, blocks located on an upper side are preferentially stored, and regarding each block in the same row, blocks located on the left side are preferentially stored.
In the example shown in FIG. 13, the image width W is four times as large as the PE number N. Thus, the number of columns of the block is four. As shown in FIG. 13, the blocks 1-4 in the uppermost row (first row) in the image data in the external memory are first stored in the order of the blocks 1, 2, 3, and 4, and the blocks 5-8 that are in the second row from the top are stored in the order of the blocks 5, 6, . . . .
Regarding the data in each block, as is similar to the case of the vertical direction priority shown in FIG. 12, N pieces of pixels in each row are stored in the same local address of the local memories 44 of N pieces of PEs in the order of rows.
In such a case in which all the pixels in the same row in an image are preferably stored in the local memories of the PE array 30 simultaneously to easily assemble the processing procedures, the storage method of the lateral direction priority shown in FIG. 13 is used. In such a case, by adjusting the number of rows M of the pixels in the block in consideration of the capacity of the local memories, all the pixels in the same row of an image may be simultaneously stored in the local memories of the PE array 30.
Consider designation of local addresses when the CP 20 causes the PE array 30 to execute processing when the pixel data is stored in the local memories of the PE array 30 in the lateral direction priority method. The one shown in FIG. 13 is used as an example of image data.
For example, as shown in FIG. 14, when the PE array 30 processes each pixel (shown in thick lines in FIG. 14) of the first row of the block 1, the CP 20 broadcasts “0” to the PE array 30 as the local address of the pixels which are to be processed. Accordingly, all the PEs are able to specify the pixels which are to be accessed by one instruction.
In the similar way, for example, as shown in FIG. 15, when the PE array 30 processes each pixel in the M-th row of the block 6, the CP 20 broadcasts “6×M−1” to the PE array 30 as the local address of the pixels which are to be processed. Accordingly, all the PEs are able to acquire the pixels which are to be accessed by one instruction.