1. Field of the Invention
The present invention relates to a semiconductor device and, more specifically, to a configuration of a processing circuit using a semiconductor memory performing arithmetic/logic operation on a large amount of data at high speed.
2. Description of the Background Art
Recently, along with wide spread use of portable terminal devices, digital signal processing allowing high speed processing of a large amount of data including voice and image comes to have higher importance. For such digital signal processing, generally, a DSP (Digital Signal Processor) is used as a dedicated semiconductor device. Digital signal processing of audio and image includes data processing such as filtering, which frequently requires arithmetic operations with repetitive product and sum operations. Therefore, a DSP is generally configured to contain a multiplication circuit, an adder circuit and a register for accumulation. When such a dedicated DSP is used, the product and sum operation can be executed in one machine cycle, enabling a high-speed arithmetic/logic operation.
Prior art Reference 1 (Japanese Patent Laying-Open No. 06-324862) shows an arrangement that uses a register file to perform such a sum-of-products operation. According to Reference 1, two terms of operand data stored in the register file are read, added by a processor, and again written to the register file through a write data register. In the arrangement shown in Reference 1, a write address and a read address are simultaneously given to the register file to execute data writing and data reading simultaneously, and therefore, time of processing can be made shorter than an arrangement having a data write cycle and a data read cycle provided separately for an arithmetic/logic operation.
Prior art Reference 2 (Japanese Patent Laying-Open No. 05-197550) shows an arrangement aimed at high speed processing of a large amount of data. In this arrangement shown in FIG. 2, a plurality of processors are arranged in parallel, with each processor containing a memory. To achieve high speed parallel operations, each processor individually generates a memory address.
Further, prior art Reference 3 (Japanese Patent Laying-Open No. 10-074141) shows a signal processing apparatus aimed at high speed processing such as DCT (Discrete Cosine Transform) of image data. In the arrangement shown in Reference 3, image data are input in a bit-parallel and word-serial sequence, that is, by the word (pixel data) unit, and therefore, the data are converted to word-parallel and bit-serial data by a serial/parallel converter circuit and written to a memory array. Then, the data are transferred to processors (ALUs) arranged corresponding to the memory array, and parallel operations are executed. The memory array is divided into blocks corresponding to image data blocks, and in each block, pixel data forming the corresponding image block are stored word by word in each row.
In the arrangement shown in Reference 3, data are transferred on the word by word (data corresponding to one pixel) basis between the memory block and the corresponding processor. To achieve high speed filtering such as DCT, the same process is performed on the transferred word in the corresponding processor in each block. The results of arithmetic/logic operations are again written to the memory array, subjected to parallel/serial conversion so that the bit-serial and word-parallel data are converted to bit-parallel and word-serial data, and the resulting data are output successively line by line. In a general processing, bit positions of data are not converted, and general arithmetic/logic operations are executed on a plurality of data in parallel by the processors.
Prior art Reference 4 (Japanese Patent Laying-Open No. 2003-114797) shows a data processing apparatus aimed at executing a plurality of different arithmetic/logic operations in parallel. According to Reference 4, a plurality of logic modules each having limited functions are connected to multi-port type data memories. As to the connection between the logic modules and the multi-port data memories, the ports and memories of the multi-port memories to be connected to the logic modules are limited. Therefore, an address area available for data reading and writing by each logic module accessing the multi-port data memory is limited. The result of operation by each logic module is written to a data memory to which access is allowed, and through the multi-port data memories, data are successively transferred through the logic modules, to achieve data processing in a pipe-line manner.
When the amount of data to be processed is very large, even a dedicated DSP is insufficient to attain dramatic improvement in performance. By way of example, when the data to be operated includes 10,000 sets and an operation of each data can be executed in one machine cycle, at least 10,000 cycles are necessary to finish the operation. Therefore, though each process can be done at high speed in an arrangement that performs the sum-of-products operation using a register file such as described in Reference 1, when the amount of data increases, the time of processing increases in proportion thereto as the data are processed in series, and therefore, such an arrangement cannot achieve high speed processing.
When such a dedicated DSP is used, the processing performance greatly depends on operating frequency, and therefore, if high speed processing were given priority, power consumption would considerably be increased.
Use of a register file and processors as described in Reference 1 is often designed for a specific application, so that the operation bit width and configuration of processing circuit tend to be fixed. When the arrangement is to be diverted to another application, the bit width, configuration of processing circuit and others be re-designed, and hence, it lacks flexibility for different applications of arithmetic/logic operations.
In the arrangement described in Reference 2, each processor contains a memory, and each processor accesses a different memory address area for processing. The data memory and the processor are arranged in separate areas, and in a logic module, address transfer and data access must be performed between the processor and the memory. This means that data transfer takes time, machine cycle cannot be made shorter and hence, high speed processing is hindered.
The arrangement described in Reference 3 is to increase speed of processing such as DCT of image data, and in this arrangement, pixel data of one line of an image plane are stored in one row of memory cells, and image blocks aligned along the row direction are subjected to parallel processing. Therefore, when the number of pixels per line increases to achieve very fine images, the memory array arrangement would be of a formidable size. Assume that data of one pixel consists of 8 bits and one line has 512 pixels, the number of memory cells of one row in the memory array will be 8×512=4 k bits, resulting in very significant load on a row selecting line (word line) to which one row of memory cells are connected. Thus, it becomes impossible to select, at high speed, a memory cell to transfer data between the operating portion and the memory cell, hindering high speed processing.
Though Reference 3 shows an arrangement in which the memory cell arrays are positioned on opposite sides of a group of processing circuits, specific configuration of the memory array is not shown. Further, though the reference shows an arrangement of processors in an array, specific arrangement of the group of processors is not shown at all.
The arrangement described in Reference 4 is provided with a plurality of multi-port data memories and a plurality of processors (ALUs) of low function that can access only limited areas of the respective multi-port memories. The processors (ALUs) and the memories, however, are arranged on different areas. Therefore, because of line capacitance and the like, high speed data transfer is difficult, and even when pipeline processing is performed, the machine cycle of the pipeline cannot be made shorter.
References 1 to 4 do not consider at all how to accommodate data as the object of arithmetic/logic operation having different word configurations.
In an arrangement in which a number of processors are arranged and data are transferred among the group of processors to achieve parallel operations, it is possible to flexibly accommodate for the change in processing contents by switching a data transfer path. As regards such a switching of data transfer path, a cross bar switch is used for line exchange in the field of communication or a router in a parallel computer. Prior art Reference 5 (Japanese Patent Laying-Open No. 10-254843) discloses an exemplary configuration of the cross bar switch.
In the cross bar switch configuration according to Reference 5, switches are arranged along paths that allow connection of functional blocks, and in accordance with path designating information, the switches are selectively made conductive to set a data transfer path. When such a switch matrix is used, however, as the number of processors (functional blocks) to be connected increases, possible number of connectable paths increases, the layout area of switch circuits increases and in addition, the arrangement of switch control signal lines becomes complicated.