Processors that operate in a single instruction multiple data (SIMD) style have been developed because SIMD processing systems have great potential for achieving high speed parallel processing more efficiently. When processing e.g. a pedestrian or white line detection algorithm in the SIMD processor system, in a first step, possible candidate areas are detected and in a following step, these candidate areas are verified. To use the processing power of the SIMD processor system optimally, the PE array can be utilized not only for the possible candidate area detection in the first step but also for the verification of the ROI areas. Therefore, each ROI area has to be loaded to the internal memory of each PE, so that the same algorithm can be executed on the PE array with different assigned ROI areas for each PE.
However, due to the fact that the processing elements are operating in SIMD style, all processing elements except for a single PE have to wait when loading the ROI area for the single PE, which reduces the possible gain of the SIMD style processing compared to e.g. a sequential processing of each ROI area in the central processor (CP).
An example of the SIMD processor is shown in the prior art NPTL 1. FIG. 13 is showing the architecture of the processor. The architecture consists of an array of processing elements (PEs) 104. The array is composed of PEs 101 with internal memory 102, which are grouped into group of PE with internal memory 103. Data is transferred between the internal memory array and an external memory (EMEM) 108 over a bus 105. Line buffers 106 are arranged over the bus 105 in such a way that between two line buffers either a group of PE or a control processor 107 is connected to the bus 105.
FIG. 14 shows the operation of the line transfer, where an autonomous DMA operation is used in NPTL 1 for the data transfer between internal and external memory in SIMD mode. For a transfer from the internal memories to the external memory 108, e.g., one element row 201, which holds from each PE one element equal to 1 byte, is first read from the internal memories in parallel and stored inside the line buffers 106 of the bus 105. Then, the content of the line buffers 106 of the bus 105 are shifted out to the external memory 108 before the following row 202 is read from the internal memories.
For the data transfer between the internal memories and the external memory 108, always whole element rows are transferred.
If only a part is needed to be transferred, a masking operation is used for the write operation to the EMEM 108.
In that case, the write action should be suspended for the data elements of some PEs though data could be read and transferred for each PE.
To process multiple ROI areas in such kind of architecture, there exist two possibilities.
Firstly, the processing is purely done in the CP. In this case, the ROI areas are sequentially transferred and executed one after each other while the PE array is not utilized.
This takes a large amount of time while the DMA is ineffective and the processing power of the PE array is unused.
Secondly, the processing is done in the PE array. Here, the processing could be done in parallel utilizing the SIMD parallelism.
However, because the unarranged data in EMEM cannot be loaded in parallel with the existing line transfer operation, this data transfer is executed sequentially by transferring the data element wise to each processing element masking out the other processing elements.
That is, all PE except for a single PE are masked so that the assigned ROI data can be written in only the single internal memory of the single PE while other PEs are masked.
But every PE has to be accessed element by element while the other PEs are not accessed, which leads that it takes much longer time to transfer the data for all PE.
Here, we would like to show one more example that is described in a patent application filed by the same applicant.
Japanese patent application No. 2011-159752 filed on Jul. 21, 2011 (patent literature 1) describes new idea to transfer data more efficiently using SIMD processor.
Referring FIG. 15, we assume the case that BK1-BK6 are ROIs that should be transferred to each assigned PE and BK1-BK6 have different size each other.
In this case, a DMA controller uses as transfer parameters the maximum value of each ROI parameter.
In FIG. 15, BK2 has the maximum height (Lmax) and BK5 has the maximum width (Wmax).
Once the CP sets start addresses of each region to the DMA controller, as shown in FIG. 16, Lmax×Wmax size regions can be transferred to each PEs respectively in parallel processing by the DMA controller.