1. Field of the Invention
The present invention relates to a semiconductor memory device and, specifically, to a content addressable memory (CAM) storing ternary data. More specifically, the present invention relates to a CAM having an orthogonal transformation function of transposing rows and columns of a multi-bit data arrangement.
Particularly, the present invention relates to a semiconductor memory device realizing the orthogonal transformation function of transforming arrangement between external data and the processed data, in a semiconductor signal processing device having a parallel operation processing function.
2. Description of the Background Art
Recently, along with wide spread use of portable terminal equipment, digital signal processing allowing high speed processing of a large amount of data such as voice and image comes to have higher importance. For such digital signal processing, generally, a DSP (Digital Signal Processor) is used as a dedicated semiconductor device. Digital signal processing of voice and image data includes data processing such as filtering, which frequently requires arithmetic operations with repetitive sum-of-products operations. Therefore, a general DSP is configured to have a multiplication circuit, an adder circuit and a register for storing data before and after the operations. When such a dedicated DSP is used, the sum-of-products operation can be executed in one machine cycle, enabling a high-speed arithmetic/logic operation.
In the DSP, data words are processed successively. Therefore, when the amount of data to be processed is very large, even a dedicated DSP is insufficient to achieve dramatic improvement in performance. By way of example, when the data to be operated includes 10,000 sets and an operation of each data set can be executed in one machine cycle, at least 10,000 cycles are necessary to finish the operation. Therefore, when data are processed serially using a dedicated DSP and the amount of data increases, the time of processing increases in proportion thereto, and therefore, it becomes difficult to achieve high speed processing.
An SIMD (Single Instruction Multiple Data) processor has been known in which a plurality of data items are processed in parallel in accordance with one same instruction, in order to process a large amount of data at high speed. In the SIMD processor, in accordance with a common instruction, different data items are processed in parallel in a plurality of element processors. An arrangement using a content addressable memory (CAM) for searching and executing a process on a data item that satisfies certain conditions in an SIMD processor is disclosed in Reference 1 (Japanese Patent National Publication No. 2004-520668: WO2002/043068).
In Reference 1, a memory cell is formed of an N channel MOS transistor (insulated gate type field effect transistor), and complementary data are stored in the memory cell. Data writing/reading to and from the memory cell is executed by using a bit line. The bit line is also used as a search line, and the search line is driven dependent on match/mismatch between the search data and the stored data. As the search line, two search lines are provided, and a “1-match line” indicating that data “1” matches, and a “0-match” line that is driven when data “0” match between the search data and the stored data are provided. Using these two search lines, which of the data “1” and “0” matches is determined.
Reference 1 also shows an arrangement in which the bit/search line is selectively connected to a storage node of a memory cell in accordance with a write enable signal, in order to realize a write mask, by selectively writing data to each of the memory cells. According to Reference 1, a parallel memory array of CAM cells storing data in parallel and allowing a search of parallel data, and a serial bit array orthogonal to the parallel memory array are arranged, and data are transmitted to the parallel memory array in accordance with the data stored in the serial bit array, so that selective parallel writing to bits at specific positions of the parallel data word becomes possible.
Further, Reference 2 (Japanese Patent Laying-Open No. 10-074141) shows an arrangement aimed at high speed processing such as DCT (Discrete Cosine Transform) of image data. In the arrangement shown in Reference 2, image data are input in a bit-parallel and word-serial sequence, that is, by the word (pixel data) unit, and thereafter, the data are converted to word-parallel and bit-serial data by a serial/parallel converter circuit and written to a memory array. Then, the data are transferred to processors (ALUs) arranged corresponding to the memory array, and parallel operations are executed. The memory array is divided into blocks corresponding to image data blocks, and in each block, pixel data forming the corresponding image block are stored word by word in each row.
In the arrangement shown in Reference 1, a CAM is used as a subclass associative processor of SIMD processor. The CAM has a full NMOS structure in which only the N channel MOS transistor is used as a component, and the bit line is also used as the search line. Data writing to the CAM cell can be masked using the write enable signal. At the time of searching, search data is transferred to the bit line, and by a search gate in the CAM cell, the match line is driven. The write enable signal simply controls connection between the bit line and the storage node in the CAM cell. When the data read from the CAM cell is to be stored at a transfer destination, it is impossible to mask the data writing of transferred data bits at the destination. Reference 1 is silent about the write mask function of masking a data write at the destination of data transfer.
The CAM cell shown in Reference 1 has a full NMOS configuration, and in addition, the bit line also serves as the search line. Therefore, when data is written or read through the bit line, the search gate in the CAM cell is also rendered conductive. The search gate of the CAM cell is coupled to a low-side power supply node of a flip-flop storing data, and dependent on the data stored in the flip-flop, the search line is driven to the low-side power supply voltage level. Therefore, at the time of data writing or reading to or from the CAM cell through the bit line, the search gate is rendered conductive (complementary data are transmitted to the bit line pair), the low-side power supply node of the flip-flop of the CAM cell is connected to the match line, and current consumption in writing or reading increases (the match line is maintained at the low-side power supply voltage level).
Further, in the CAM cell structure, a pair of match lines is used, and by the 1-match line and the O-match line, search is performed using the complementary data as the search data. The search gate is coupled to the low-side power supply node of the memory cell, and therefore, in searching operation, the match line is precharged to the high-side power supply voltage level. Further, the CAM cell has the full NMOS structure, and therefore, the bit/search line is precharged to the low-side power supply voltage level. Consequently, the precharge voltage level of the bit line differs in data reading/writing and in a searching operation, and therefore, voltage control of the bit line is complicated.
Reference 1 shows a CAM processor, and the data stored in the CAM are processed. Reference 1, however, does not discuss necessity of processing such as transformation of the sequence of data arrangement in the CAM in data processing.
In the arrangement shown in Reference 2, data are transferred on the word-by-word (data corresponding to one pixel) basis between the memory block and the corresponding processor. To implement filtering such as DCT, the same process is performed on the transferred word in the corresponding processor in each block. The results of arithmetic/logic operations are again written to the memory array, subjected to parallel/serial conversion so that the bit-serial and word-parallel data are converted to bit-parallel and word-serial data, and the resultant data are output successively line by line of the image screen. In common processing, bit positions of data are not converted, and common arithmetic/logic operations are executed on the transferred data in parallel by each of the processors.
In the arrangement described in Reference 2, pixel data of one line of an image screen are stored in one row of memory cells, and image blocks aligned along the row direction are subjected to parallel processing. Therefore, when the number of pixels per line increases to realize very fine images, the memory array arrangement would be of an impermissible size. Assume that data of one pixel consists of 8 bits and one line has 512 pixels, the number of memory cells of one row of memory cells will be 8×512=4 k bits, increasing a load on a row selecting line (word line) to which one row of memory cells are connected. Thus, it becomes difficult to select at high speed a memory cell and transfer data between the operating portion and the memory cell, hindering high speed processing.
Further, References 1 and 2 do not address how to execute parallel processing when the data of the object of processing have different word configurations.
The inventors' group has already proposed a configuration allowing high speed operation even when the data of the object of processing have different word configurations (Japanese Patent Application Nos. 2004-171658 and 2004-358719). In the signal processing device proposed by the inventors' group, a processor is arranged corresponding to each column of the memory array (in a bit line extending direction: entry). The data of the object of processing are stored in each entry, and in each processor, arithmetic/logic operation is performed in a bit-serial manner.
In this arrangement, the data to be processed are stored in the entry corresponding to each column, and the arithmetic/logic operation is executed in the bit-serial manner, and therefore, even when the data have different bit width (word configuration), only the number of operation cycles is increased and the contents of operation are unchanged. Therefore, the arithmetic/logic operation can be executed in parallel, even on the data having different word configurations.
Further, as the processors process in parallel, it follows that the processors same in number as the entries (columns) execute parallel processing, and therefore, the time for processing can be reduced as compared with the sequential (word-serial) processing of each data word. By way of example, consider a two-term (binary) operation of 8-bit data with the entry number of 1024. Assuming that transfer, operation and storage of the operation result of the two-term data each require 1 machine cycle, the necessary number of cycles would be 8×2, 8 and 8 cycles, that is, a total of 32 operation cycles (and one more cycle for storing a carry). Compared with the configuration in which 1024 data are successively processed, however, the time of operation can significantly be reduced, as the operation is executed in parallel among 1024 entries.
In the signal processing device, in order to achieve high speed processing by effectively utilizing the advantageous characteristics of parallel processing, it is necessary to transfer data efficiently to the memory area storing the data before and after the operations, and the circuit for data transfer must satisfy the conditions of small occupation area and low power consumption.
The CAM shown in Reference 1 described above is a single port memory, and hence it is incapable of such transformation of data arrangement.
Further, in parallel processing, even when the contents of the data to be processed have low degree of parallelism (the number of data items to be processed in parallel is small), high speed processing is required without degrading processing performance.