The present invention relates to associative processors and, more particularly, to an associative processor configured to perform two or more different arithmetical operations simultaneously and methods for loading the associative processor with data to be processed and for downloading the data after processing.
An associative processor is a device for parallel processing of a large volume of data. FIG. 1 is a schematic illustration of a prior art associative processor 10. The heart of associative processor 10 is an array 12 of content addressable memory (CAM) cells 14 arranged in rows 16 and columns 18. Associative processor 10 also includes three registers for controlling CAM cells 14: a tags register 20 that includes many tag register cells 22, a mask register 24 that includes many mask register cells 26, and a pattern register 28 that includes many pattern register cells 30. Each cell 14, 22, 26 or 30 is capable of storing one bit (0 or 1). Tags register 20 is a part of a tags logic block 36 that communicates with each row 16 via a dedicated word enable line 32 and a dedicated match result line 34, with each tag register cell 22 being associated with one row 16 via word enable line 32, match result line 34 and a dedicated logic circuit 38. Each mask register cell 26 and each pattern register cell 30 is associated with one column 18. For illustrational simplicity, only three rows 16, only one word enable line 32, only one match result line 34 and only one logic circuit 38 are shown in FIG. 1. Typical arrays 12 include 8192 (213) rows 16. The array 12 illustrated in FIG. 1 includes 32 columns 18. More typically, array 12 includes 96 or more columns 18.
Each CAM cell 14 can perform two kinds of elementary operations, as directed by the contents of the corresponding cells 22, 26 or 30 of registers 20, 24 and 28: compare operations and write operations. For both kinds of elementary operations, columns 18 that are to be active are designated by the presence of xe2x80x9c1xe2x80x9d bits in the associated mask register cells 26. The contents of tag register cells 22 are broadcast to the associated rows 16 as xe2x80x9cwrite enablexe2x80x9d signals by tags logic block 36 via word enable lines 32, with rows 16 that receive a xe2x80x9c1xe2x80x9d bit being activated. In a single cycle of compare operations, each activated row 16 generates a xe2x80x9c1xe2x80x9d bit match signal on match result line 34 of that row 16. Each activated CAM cell 14 of that row 16 compares its contents with the contents of the cell 30 of pattern register 28 that is associated with the column 18 of that CAM cell 14. If the two contents are identical (both xe2x80x9c0xe2x80x9d bits or both xe2x80x9c1xe2x80x9d bits), that CAM cell 14 allows the match signal to pass. Otherwise, that CAM cell 14 blocks the match signal. As a result, if the contents of all the activated CAM cells 14 of a row 16 match the contents of corresponding cells 30 of pattern register 28, the match signal reaches tags logic block 36 and the associated logic circuit 38 writes a xe2x80x9c1xe2x80x9d bit to the associated tag register cell 22; otherwise, the associated logic block 38 writes a xe2x80x9c0xe2x80x9d bit to the associated tag register cell 22. In a single cycle of write operations, the contents of pattern register cells 30 associated with activated columns 18 are written to the activated CAM cells 14 of those columns 18.
In the example illustrated in FIG. 1, the fifth through eighth columns 18 from the right are activated by the presence of xe2x80x9c1xe2x80x9ds in the corresponding mask register cells 26. A binary xe2x80x9c4xe2x80x9d (0100) is stored in the corresponding pattern register cells 30. A compare operation cycle by associative processor 10 in this configuration tests activated rows 16 to see if a binary xe2x80x9c4xe2x80x9d is stored in their fifth through eighth CAM cells 14 from the right. A write operation cycle by associative processor 10 in this configuration writes binary xe2x80x9c4xe2x80x9d to the fifth through eighth CAM cells 14 from the right of activated rows 16.
In summary, in both kinds of elementary operations, tags register 20 and mask register 24 provide activation signals and pattern register 28 provides reference bits.
Then, in a compare operation cycle, array 12 provides input to compare with the reference bits and tags register 20 receives output; and in a write operation cycle, array 12 receives output that is identical to one or more reference bits.
Tags logic block 36 also can broadcast xe2x80x9c1xe2x80x9ds to all rows 16, to activate all rows 16 regardless of the contents of tags register 20.
An additional function of tags register 20 is to provide communication between rows 16. The results of a compare operation executed on rows 16 are stored in tags register 20, wherein every bit corresponds to a particular row 16. By shifting tags register 20, the results of this compare operation are communicated from their source rows 16 to other, target rows 16. In a single tags shift operation the compare result of every source row 16 is communicated to a corresponding target row 16, the distance between any source row 16 and the corresponding target row 16 being the distance of the shift.
Any arithmetical operation can be implemented as successive write and compare cycles. For example, to add an integer N to all the m-bit integers in an array, after the integers have been stored in m adjacent columns 18 of array 12, with one integer per row 16, the following operations are performed:
For each integer M that can be represented by m bits (i.e., the integers 0 through 2mxe2x88x921):
(a) write M to the cells 30 of pattern register 28 that correspond to the m adjacent columns 18;
(b) activate all rows 16 by broadcasting xe2x80x9c1xe2x80x9d to all rows 16;
(c) execute a cycle of simultaneous compare operations with the activated CAM cells 14 to set to xe2x80x9c1xe2x80x9d the contents of tag register cells 22 associated with rows 16 that store M and to set to xe2x80x9c0xe2x80x9d the contents of all other tag register cells 22;
(d) write M+N to the cells 30 of pattern register 28 that correspond to the m adjacent columns 18; and
(e) execute a cycle of simultaneous write operations with the activated CAM cells 14 to write M+N to the activated rows 16.
Associative processor 10 is well-suited to the parallel processing of data, such as digital image data, that consist of relatively short integers. For example, each pixel of an image with 256 gray levels is represented by an 8-bit integer. To add a number N to 8192 such integers in a serial processor requires 8192 add cycles. To add N to 8192 such integers in associative processor 10 requires 256 compare cycles and 256 write cycles.
More information about prior art associative processors may be found in U. S. Pat. No. 5,974,521, to Akerib, which is incorporated by reference for all purposes as if fully set forth herein.
Nevertheless, prior art associative processors such as associative processor 10 suffer from certain inefficiencies. First, rows 18 must be wide enough to accommodate all the operands of every arithmetical operation that is to be performed using the associative processor. Most arithmetical operations do not require the full width of array 12, so most of the time, many CAM cells 14 are idle. Second, although the arithmetical operations themselves are performed in parallel, the input to array 12 and the output from array 12 must be effected serially. For example, one way to store the input m-bit integers of the above example in the m adjacent columns 18 of array 12 is as follows:
(a) Select m adjacent columns 18 of array 12 to store the input integers.
Set the contents of the corresponding mask register cells 26 to xe2x80x9c1xe2x80x9d and the contents of all the other mask register cells 26 to xe2x80x9c0xe2x80x9d.
(b) For each input integer, write the integer to the cells 30 of pattern register 28 that correspond to the selected columns 18, activate one row 16 of array 12 by setting the contents of the corresponding tag register cell 22 to xe2x80x9c1xe2x80x9d and the contents of all the other tag register cells to xe2x80x9c0xe2x80x9d, and execute a cycle of simultaneous write operations with the activated CAM cells 14.
Storing 8192 input integers in this manner requires 8192 write cycles, the same number of cycles as the 8192 fetch cycles that would be required by a serial processor.
Furthermore, if the data to be processed are stored in a dynamic random access memory (DRAM), then, in order to access the data stored in a row of the DRAM, a row precharge is required. This row precharge typically requires six to ten machine cycles. It would be highly advantageous to maximize the input at every row precharge. In the case of embedded DRAM, each row may store thousands of bits. It would be highly advantageous to be able to input many or all of these bits into an associative array processor in only a small number of machine cycles, especially in an application, such as real-time image processing, which requires very high data rates, typically upwards of 30 VGA frames per second.
The serial input/output issue has been addressed to a certain extent by Akerib in U. S. Pat. No. 6,195,738, which is incorporated by reference for all purposes as if fully set forth herein. According to U.S. Pat. No. 6,195,738, the memory, wherein the data to be processed are stored, is connected to tags register 20 by a bus with enough bandwidth to fill tags register 20 in one machine cycle. Enough data bits to fill tags register 20 are written from the memory to tags register 20 via the bus. A write operation cycle is used to write these bits to one of columns 18. This is repeated until as many columns 18 as required have received the desired input. This procedure is reversed, using compare operations instead of write operations. to write from array 12 to the memory.
Although the teachings of U.S. Pat. No. 6,195,738 enable parallel input and output, column by column. xe2x80x9cfrom the sidexe2x80x9d, rather than word by word, xe2x80x9cfrom the topxe2x80x9d, this parallel input and output leaves room for improvement. For example, according to the teachings of U.S. Pat. No. 6,195,738, the bus that connects the memory to tags register 20 must have enough bandwidth to fill tags register 20 in one machine cycle. It is difficult to fabricate such a bus for a typical tags register 20 that includes 8192 tag register cells 22, as such a bus would have to have sufficient bandwidth to transfer 8192 bits at once. In addition, although such a bus would be used for only a small fraction of the overall processing time, such a bus would generate power consumption peaks when used. It would be advantageous to reduce the magnitude of the power consumption peaks while maintaining sufficient bandwidth to transfer the bits of tags register 20 to the memory in only a small number of machine cycles.
There is thus a widely recognized need for, and it would be highly advantageous to have, an associative processor that uses its CAM cells more intensively than known associative processors and that supports parallel input and output in a manner superior to that known in the art.
According to the present invention there is provided a method of processing a plurality of bits stored in a memory, including the steps of: (a) providing an associative processor including: (i) a first array of content addressable memory (CAM) cells, the first array including a plurality of columns of the CAM cells; (b) writing a first subplurality of the bits from the memory to a first the column of the CAM cells, each bit of the first subplurality being written to a respective CAM cell of the first column; and (c) copying the first subplurality of bits from the first column to a second the column of the CAM cells.
According to the present invention there is provided a device for processing data, including: (a) a memory for storing the data; (b) an associative processor, for processing the data, the associative processor including a plurality of rows and columns of content addressable memory (CAM) cells; and (c) a bus for exchanging the data between the memory and one of the columns of CAM cells.
An associative processor of the present invention includes several arrays of CAM cells, as well as a tags logic block that includes several tags registers. Each row of each CAM cell array is connected to the tags logic block by its own word enable line and by its own match result line, so that the tags logic block can associate any of its tags registers with one or more of the CAM cell arrays. Furthermore, the tags logic block can change that association at any time. Specifically, the logic circuit, that is associated with corresponding rows of the several arrays, manages the signals on the word enable lines and the match result lines of these CAM cell arrays with reference to corresponding tag register cells in any one of the tags registers. For example, the tags logic block effects logical combinations (e.g., AND or OR) of match signals and prior contents of the cells of one tag registers, and stores the results either in place in the same tags register or in another tags register.
It is preferable that at least one of the tags registers be located between two of the CAM cell arrays. Either the entire tags logic block is located between two of the CAM cell arrays, or one or more but not all tags registers are located between two of the CAM cell arrays. In the latter case, the components of the tags logic block necessarily are not all contiguous.
The ability to xe2x80x9cmix and matchxe2x80x9d CAM cell arrays and tags registers enhances the efficiency with which the CAM cells of the present invention are used. To this end, the CAM cell arrays of the present invention typically have fewer columns than prior art CAM cell arrays. In fact, it is preferred that the sum of the number of columns of the CAM cell arrays of the present invention be equal to the number of columns needed by a prior art CAM cell array to perform all the contemplated arithmetical operations. For example, in an embodiment of the associative processor of the present invention that includes two CAM cell arrays, each with half as many columns as a prior art CAM cell array, two arithmetical operations that each require half the columns of the prior art CAM cell array are performed in parallel, with one of the arithmetical operations being performed with reference to one of the tags registers and another of the arithmetical operations being performed with reference to another of the tags registers. The two arithmetical operations may be either identical or different. To perform an arithmetical operation that requires the full width of a prior art CAM cell array, both CAM cell arrays of the present invention are associated with the same tags register, and the arithmetical operation is performed with reference to that tags register. Furthermore, arithmetical operations may be pipelined. To pipeline two sequential arithmetical operations, one CAM cell array is dedicated to the first operation and another CAM cell array is dedicated to the second operation. Compare operation cycles on the first CAM cell array are paired with write operation cycles on the second CAM cell array to transfer the output of the first operation from the first CAM cell array to the second CAM cell array for the second operation, with the same tags register being associated with the first CAM cell array for the compare operation cycles and with the second CAM cell array for the write operation cycles. In each elementary operation cycle pair, a column of the first CAM cell array, activated by appropriate bits in the corresponding mask and pattern registers, is copied to a column of the second CAM cell array, also activated by appropriate bits in the corresponding mask and pattern registers. Note that the mask and pattern registers are shared by all the CAM cell arrays.
Preferably, the tags logic block can configure two of the tags registers temporarily as a single long tags register. This capability is useful, for example, in processing two contiguous portions of a digital image, each portion being stored in a different CAM cell array. In particular, during the application of an operator, such as a smoother or a convolution, that requires input from both sides of the boundary between the two portions, each of the two tags registers is associated with one of the CAM cell arrays, and compare operations are performed on the CAM cell arrays, with output to their respective tags registers. Then the contents of the tags registers are shifted, with bits that leave one tags register being shifted to the other tags register. In this way, data from one of the two contiguous portions of the digital image are processed with reference to data from the other portion, despite the two portions being stored in different CAM cell arrays. In subsequent operations, data in the two contiguous portions may be processed separately, in the usual manner. Following a compare operation on one of the CAM cell arrays, the contents of the tags register associated with that CAM cell array are shifted only within that tags register, with bits that leave one end of the tags register being either discarded or cycled to the other end of the tags register, so that the data stored in that CAM cell array are processed independently of the data stored in the other CAM cell array.
The ability to xe2x80x9cmix and matchxe2x80x9d CAM cell arrays and tags registers also facilitates another aspect of the present invention, the parallelization of input and output in a manner superior to that taught in U.S. Pat. No. 6,195,738. For example, to process data stored in a memory simultaneously in two CAM cell arrays, as described above, one of the tags registers is designated as an input tags register. This input tags register is associated with one of the CAM cell arrays. Enough data bits to fill the input tags register are written from the memory to the input tags register, over the course of several machine cycles, using a bus with less bandwidth than is needed to fill the input tags register in one machine cycle. In each machine cycle, a control block selects the tag register cells of the input tags block that are to receive the data bits that are written from the memory to the input tags block during that machine cycle. After the tags register is filled, a write operation cycle is used to write these bits to a column of the target CAM cell array. This is repeated until as many columns of the CAM cell array as required have received the desired input. Then the input tags register is associated with a different CAM cell array. Another set of data bits is written from the memory to the input tags register, and a write operation cycle again is used to write these bits to a column of the second CAM cell array. This is repeated until as many columns of the second CAM cell array as required have received the desired input.
A data processing device of the present invention includes, in addition to the associative processor, a memory, preferably a random access memory, for storing data to be processed and a bus for exchanging data between the memory and the associative processor. The associative processor includes an input/output buffer, for storing data that is exchanged between the associative processor and the memory via the bus. This buffer includes as many buffer cells as there are rows in each array of CAM cells. As noted above, the bus exchanges fewer bits at one time between the memory and the buffer than there are buffer cells in the buffer. A control block is provided to direct bits, that are transferred together from the memory to the associative processor, to the correct subset of the buffer cells, and to designate the correct subset of the buffer cells from which to transfer bits collectively to the memory. In one preferred embodiment of the data processing device of the present invention, one of the tags registers is used as the input/output buffer, as in U.S. Pat. No. 6,195,738. In another preferred embodiment of the data processing device of the present invention, the input/output buffer is one of the columns of CAM cells.
As many bits as there are rows of CAM cells in the associative processor are exchanged between the buffer and a target column of the associative processor in one elementary operation (compare or write) cycle. This is much faster than the one data element per elementary operation cycle of the prior art serial input/output method. This enhanced speed enables yet another aspect of the present invention. Because the rows of the CAM cell arrays of the present invention typically are shorter than the rows of prior art CAM cell arrays, an arithmetical operation executed on one of the CAM cell arrays may produce columns of intermediate results that leave insufficient room in the CAM cell array for the execution of subsequent arithmetical operations. These columns of intermediate results are written to the random access memory, via the input/output buffer, for temporary off-line storage, with one column of intermediate results being written in one machine cycle. As described above in the context of the parallelization of input and output, the number of machine cycles needed to transfer a column of intermediate results from the input/output buffer to the random access memory, or vice versa, depends on the bandwidth of the bus that connects the input/output buffer to the random access memory. When these columns of intermediate results are again needed, they are retrieved from the random access memory, also via the input/output buffer.