1. Field of the Invention
The present invention relates to a semiconductor processing device and, more specifically, to a configuration of a processing circuit performing arithmetic/logic operations on a large amount of data at high speed using semiconductor memories.
2. Description of the Background Art
Recently, along with wide spread use of portable terminal equipment, digital signal processing allowing high speed processing of a large amount of data such as voice data and image data comes to have higher importance. For such digital signal processing, generally, a DSP (Digital Signal Processor) is used as a dedicated semiconductor device. Digital signal processing of voice and image includes data processing such as filtering, which in turn frequently requires arithmetic operations with repetitive sum-of-products operations. Therefore, a DSP is generally configured to have a multiplication circuit, an adder circuit and a register for accumulation When such a dedicated DSP is used, the sum-of-products operation can be executed in one machine cycle, enabling a high-speed arithmetic/logic operation.
When the amount of data to be processed is very large, however, even a dedicated DSP is insufficient to attain dramatic improvement in performance. By way of example, when the data to be operated assume 10,000 sets and an operation of each data set can be executed in one machine cycle, at least 10,000 cycles are necessary to finish the operation. Therefore, though each process can be done at high speed in an arrangement in which the sum-of-products operation is done using a register file, when the amount of data increases, the time of processing increases in proportion thereto as the data are processed in series, and therefore, such an arrangement cannot achieve high speed processing.
When such a dedicated DSP is used, the processing performance much depends on operating frequency, and therefore, if high speed processing is given priority, power consumption would considerably be increased.
In view of the foregoing, the applicant of the present invention has already proposed a configuration allowing arithmetic/logic operations on a large amount of data at high speed (Reference 1 (Japanese Patent Laying-Open No. 2006-127460)).
In the configuration described in Reference 1, a memory cell mat is divided into a plurality of entries, and an arithmetic logic unit (ALU) is arranged corresponding to each entry. Between the entries and the corresponding arithmetic logic units (ALUs), data are transferred in bit-serial manner, and operations are executed in parallel among a plurality of entries. For a binary operation, for example, data of two terms are read, operated and the result of operation is stored. Such operation on data is executed on bit-by-bit basis. Assuming that reading (load), operation and writing (store) of the operation result each require one machine cycle and the data word of the operation target has the bit width N, operation of each entry requires 4×N machine cycles. The data word of the operation target generally has the bit width of 8 to 64 bits. Therefore, when the number of entries is set relatively large to 1024 and data of 8-bit width are to be processed in parallel, 1024 results of arithmetic operations can be obtained after 32 machine cycles. Thus, necessary time of processing can significantly be reduced as compared with sequential processing of 1024 sets of data.
Further, in the configuration disclosed in Reference 1, data transfer circuits are provided corresponding to the entries, Inter-ALU connecting switch circuit (data transfer circuit: ECM (entry communicator)) is provided for data transfer between processors (ALUs), whereby data are transferred through dedicated buses among the entries. Therefore, as compared with a configuration in which data are transferred between entries through a system bus, arithmetic/logic operations can be executed with high-speed data transfer. Further, use of the inter-ALU connecting switch circuit achieves operations on data stored in various regions in the memory cell mat, whereby degree of freedom in operation can be increased, and a semiconductor processing device performing various operations can be realized.
In the configuration described in Reference 1, it is possible to execute one same arithmetic/logic operation in parallel in processors among all entries of the memory mat. Specifically, the parallel processing device (MTX) described in Reference 1 is a processing device based on an SIMD (Single Instruction Stream Multiple Data Stream) architecture. Further, it uses the inter-ALU connecting switch circuit, so that communications between physically apart entries can be executed simultaneously in each entry, and processes over entries can also be executed.
In the configuration described in Reference 1, it is possible to execute a pointer register instruction for operating contents of a pointer register representing an access location in the memory cell mat, a 1-bit load/store instruction, a 2-bit load/store instruction, a 1-bit inter-entry data moving instruction, a 2-bit inter-entry data moving instruction for transferring data between a data storage portion of an entry and a corresponding operational processing element (ALU), a 1-bit arithmetic/logic operation instruction, and a 2-bit arithmetic/logic operation instruction. Further, by setting to “0” the value of a mask register (V register) provided in the processing element, the operation of the corresponding entry can be masked and the operation can be set to an non-execution state.
The processing device of Reference 1 is on SIMD basis, and all entries execute one same arithmetic/logic operation in parallel. Therefore, when one same arithmetic/logic operation is to be executed on a plurality of data sets, high-speed operation becomes possible and, therefore, filtering of image data, for example, can be executed at high speed.
Arithmetic/logic operations with low degree of parallelism, however, must be executed one by one successively while operations other than the target operation are masked, or it must be processed by a host CPU. Such successive processing of arithmetic/logic operations with low degree of parallelism hinders increase in processing speed, and hence, the performance of the parallel processing device cannot be fully exhibited.
Further, in communication between entries, in a configuration of SIMD type architecture, all entries communicate in parallel with entries apart by the same distance (in accordance with the data moving instruction between entries). For each entry to communicate with an entry apart by an arbitrary distance, however, it is necessary to adjust distance of data movement by combining the moving instruction between entries (data moving instruction) and the mask bit of the V register in the processing element. Therefore, parallel processing of data movement between entries at different distances is impossible.
If the arithmetic/logic operation and/or data moving process of low degree of parallelism could be performed efficiently, the processor would have wider applications.