The present invention relates to memory access control and more specifically relates to memory access control technology for SIMD (Single Instrument Multiple Data) processors.
The SIMD processor is capable of simultaneously executing the same processing for a plurality of data from one instruction. FIG. 12 shows a typical structure of the SIMD processor.
The SIMD processor 10 shown in FIG. 12 is comprised of a control processor 20 and a processor array 30. The processor array 30 is a one-dimensional linked dispersed memory type processor array containing a plurality of (6 elements in the example shown in the drawing) processor elements. These processor elements perform the same processing according to instructions from the control processor 20. When the control processor 20 sends instructions to the PE array 30, the mask bit or mask flag (hereafter unified to “mask flag”) specifies those processor elements not required in the processing. This is, in other words, a state where the plural processor elements contained in the PE array 30 are either performing the same processing or not performing any processing.
In the following description, the control processor is called “CP” and the processor array and processor element are respectively called “PE array” and “PE.”
Each of the PE (PE1 to PE6) contained in the PE array 30 possess identical structures and so PE1 is utilized to describe a typical PE structure. In the drawing, the PE1 is comprised of a buffer 42, a local memory 44, MEMCTL 46, and an arithmetic logic unit 48.
The arithmetic logic unit 48 exchanges data with adjacent PE and also performs arithmetical processing. The MEMCTL46 controls the local memory access and the external memory access.
The local memory access is the memory access generated within the PE array 30 and more specifically is write requests and read requests from the arithmetic logic unit 48. The MEMCTL46 includes functions to write data from the arithmetic logic unit 48 into the local memory 44 according to write requests from the arithmetic logic unit 48 and to read requests from the arithmetic logic unit 48, and to read out data from the local memory 44 according to read request from the arithmetic logic unit 48 and provide the read data to the arithmetic logic unit 48.
To handle a memory access from the external section (including CP20) in the PE array 30, the MEMCTL46 includes functions to write the data for the write request onto the local memory 44 in the case of a write access, and to read out the data for the read request from the local memory 44 in the case of a read access, and output that data.
A buffer 42 exchanges data between the PE1 and external sections and temporarily stores the exchanged data. More specifically, in the case for example where the CP20 is write-accessing the local memory 44, the CP20 first stores the data for writing into the buffer 42, and then sends a write command. When the PE1 receives the write command, the MEMCTL46 writes the data stored in the buffer 42 into the local memory 44. Also, during read accessing for the local memory 44, the CP20 sends a read command including information on the data for reading. When the PE1 receives a read command, the MEMCTL46 read out the applicable data from the local memory 44 and outputs that read-out data to the buffer 42. The CP20 then reads out the data from the buffer 42 and outputs that read-out data to an external section.
The SIMD processor 10 is in this way especially effective in processing data groups (hereafter, called “two-dimensional data”) where a plurality of data pieces are arrayed two-dimensionally. Two-dimensional data is for example an image comprised of pixel data in one screen or an aggregate of data input in respective boxes in a two-dimensional table. Here, the operation of the SIMD processor 10 is described for the case of filter processing taking the average of the pixel of interest and pixel to the right of the pixel of interest for an image in which there are six pixels per one row. Unless described to the contrary, “pixel” and “pixel value” possess the same meaning in the following text.
In this case, the column of images, and the PE in the PE array 30 possess a one-to-one relationship. Examining the pixel row of interest shows that the six pixels contained in the applicable row are each stored by way of the buffer 42 into the six local memories 44 in the PE array 30. The local memory 44 in each PE stores pixels from the same row into the same address.
The local memory 44 stores each pixel of the A row in the image into the address B of local memory in each PE. During filter processing of the A row, the controller 20 in this case issues an instruction “Find the average value of A row pixels with adjacent pixels on the right. Here, along with each PE reading out the address B pixels from its own local memory, the PE also requests the pixels in address B for the adjacent PE on the right. Along with averaging the data then sent from the adjacent PE on the right in response to that request, the PE also outputs to the adjacent PE on the left, the address B pixels read out from its own local memory in response to the request from the adjacent PE on the left.
Filter processing of all pixels in the row of interest is in this way simultaneously performed with good efficiency. In the following specifications, the “row” direction of the image does not signify a lateral direction when the applicable image is played, but signifies the direction assigned to array the PE. For example, when each pixel in one row was respectively assigned to each PE during playing of an image, then the “row” for playing (or reproducing) the image matches the “row” as used in these specifications. However, when each of the pixels in one column was respectively assigned to each PE, then the “column” when image was played becomes the “row” in these specifications. Two-dimensional data other than for image is also handled in the same way.
The number of pixels in one row in the image is not limited to the same number of PEs, and normally is a larger quantity than the PE. In cases with a larger number of pixels, the image is subdivided into blocks and processing performed on each block. The number of pixels along the row direction in each of these blocks may be the same quantity as the PE.
Methods have been disclosed from a variety of perspectives for the processing up to storing data from external sections in the local memory of each PE (Japanese Unexamined Patent Publication No. Hei 11 (1999)-66033 (patent document 1) and Shorin Kyo “Video Recognition Processor LSI for Intelligent Cruise Control Based on 128 4-Way VLIW RISC Processing Element” IEICE Technical Reports, Technical Committee on Integrated Circuits and Devices (ICD), May 2003, Vol. 103, No. 89, pp. 12-24 (non-patent document 1)). The non-patent document 1 for example discloses a method for contriving a process to improve SIMD processor efficiency.
The method as described in non-patent document 1 is described here. The SIMD processor 10 shown in FIG. 12 is utilized as an example of the SIMD processor. To make the description easy to understand, the case where storing six pixels of the A row described above, from the external memory to the local memory of each PE in the PE array 30; or in other words to each local memory 44 of the respective PE1 through PE6 is used as an example.
In this method, besides each function block shown in FIG. 12, the SIMD processor 10 is also comprised of a DMA controller (DMA: Direct Memory Access). Moreover, the buffers 42 in the PE1 through PE6 also configure the same shift register, and each of the buffers 42 are one stage of the applicable shift register.
The CP20 first of all sets the address in the external memory for the first pixel among the six pixels of the A row to serve as the readout address.
The DMA controller reads out the data set in the readout address (first pixel among the six pixels in the A row) from the external memory and stores the data in the buffer 42 of the PE1. The DMA controller next increases the readout address by one, and reads out the data in the increased read-out address or in other words, reads out the second pixel from the external memory and stores that data in buffer 42 of the PE1. The prior stored data (first pixel) in the buffer 42 of PE1 is then shifted for output from buffer 42 of PE1 to the buffer 42 of PE2, and stored in the buffer 42 of PE2. Repeating this type of shifting and storing, results in the sixth through second pixels being respectively stored in the buffers 42 of PE2 through PE6, when the sixth pixel is stored in the buffer 42 of PE1.
At this point in time, the DMA controller generates an interrupt so that the CP20 can issue a write command to each PE. Each of the PE writes the data stored in its own buffer into the local memory 44 by way of the MEMCTL46.
This technique stores data from the external memory that must be stored in the local memory 44 of each PE, into the local memory 44 of the respective PE by way of the buffer 42. The DMA controller handles the task of storing the data in each buffer so that each PE can perform arithmetical processing while the DMA controller is storing data into the buffer.
The process of writing data from the external memory into the local memory of the PE can therefore suppress effects on arithmetical processing in the PE. The process when reading out data from the PE local memory into the external memory is the same.