The present invention relates generally to matrix processors and more specifically to matrix processors of images.
Modern image processing frequently requires that a massive number of repetitive computations be performed on a single array of picture elements (pixels) before they are displayed. They purpose of these computations is to enhance the contents of a single picture frame, or modify them to suit a specific image analysis objective. The two most common classes of pixel operations are the pixel-point operations and pixel-group operations. Pixel-point operations are considerably less computation-intensive than the pixel-group operations; they usually call for simple arithmetic or logic operations to be performed on each pixel of the frame. On the other hand, pixel-group operations usually involve a square or rectangular window (convolution kernel) of neighboring pixels upon which a set of arithmetic operations is to be performed. This set of operations is repeated for each pixel of the frame except for the border pixels.
The value of the pixel usually indicates the intensity of a specific RGB video component for chromatic applications, or a gray scale value of monochromatic applications (such as radar signal processing or image recognition).
It is the pixel-group operations that have, to date, established the limits of real-time image processing speeds. The most common pixel-group operation in image processing is a so-called spatial convolution, i.e. a process of multiplying selected neighboring pixels by a set of values called a convolution coefficient kernel followed by the summation of the results. For instance a typical application, a set of, say, 9 pixels is arranged as a 3.times.3 matrix: ##EQU1## where x and y indicate pixel's coordinates in the frame array. Each pixel's K-bit value (where K usually lies in the 6-12 bit range) of such a 3.times.3 array is then multiplied respectively by an M-bit coefficient value (where M can be typically 6-16 bit value) from the 3.times.3 coefficient kernel: ##EQU2## The matrix above is called the convolution kernel, and each of its elements is referred to as convolution coefficient. Finally, after all 9 multiplications are completed, their results are then summed to yield the new value of the pixel in the location x,y: ##EQU3##
The new value of the pixel NEWP(x,y) usually corresponds to the modified valued of its amplitude (gray scale). The entire process described above is called "2-D spatial filtration" or "2-D spacial convolution" and corresponds to a discrete two-dimensional filtration of the image in the time domain. Depending on the set of values A . . . I picked for the convolution kernel, a number of image processing functions can be accomplished. In particular, operations such as image smoothing, edge detection or extraction, or contrast enhancement can be accomplished. If all kernel coefficients except for the middle one (E) are picked equal to each other, or all nine of them are equal, the kernel is referred to as symmetrical. This is the most common form of the kernels used in the modern image processing. If three or more coefficients in the kernel differ from each other the kernel is called asymmetrical.
Although the use of larger kernels, such as 5.times.5, 7.times.7 or even 15.times.15, is even more desirable (since it increases the bandwidth of convolution), the amount of computations involved in convolution with such large kernels is often prohibitive for most applications. Since industry standard frame array sizes vary from 256.times.256 pixels to 4096.times.4096, the number of multiplications and additions which have to be performed for a single frame convolution varies from approximately 600,000 to almost 160 million per frame. Consequently, if the frames are to be convolved in real time (i.e. processed at the same rate, or faster, than they are acquired and digitized), the total frame convolution time puts an obvious restriction on the image acquisition time. Thus, for example, if the industry standard medium resolution image of 512.times.512 is to be acquired and convolved using 3.times.3 kernel, almost 2.5 million multiplications and additions will have to be performed. If a single eight-bit multiply/accumulate operation is assumed to require 50 nanoseconds using off-the-shelf multiplier-accumulater (MAC), the total amount of time required to complete a single frame convolution will be 0.125 seconds. This, in turn, would imply that the maximum frame repetition rate would be limited to 8 Hz, a rate too slow for most industrial and commercial applications typically requiring at least a 30 Hz frame repetition rate.
Consequently, most modern image processors do not offer real-time 3.times.3 convolution capabilities. On the other hand, in most industrial and military applications, the frame repetition rates vary from 30 Hz (interlaced NTSC standard) to as fast as 400 Hz. This implies that for 512.times.512 pixel arrays total frame convolution times in the millisecond range are needed. In practice, such convolution speeds have rarely been accomplished and only in board-level designs. ECl-based designs can meet such requirements.
Image processing chips which claim to perform near real-time convolution require large amounts of external circuitry to resequence the pixels before they are sampled by the processor. An example of this is shown in FIG. 1. This diagram illustrates the way ZORAN's ZR33481-20 DFP is used to accomplish 3.times.3 convolution. Notice that the pixels must be heavily buffered externally while a Sequencer controls the order that the DFP receives the `shuffled` image data.
Thus, it is an object of the present invention to provide a matrix processor capable of real-time processing.
Another object of the present invention is to provide a matrix processor which can do real-time convolution of 3.times.3, 5.times.5 and other kernels.
Still a further object of the present invention is to provide an image processor capable of doing kernel convolutions of matrix from 256.times.256 to multiples of that array.
An even further object of the present invention is to provide a matrix processing module which may be configured to do real-time processing of symmetrical and asymmetrical matrices operations.
A still further object of the present invention is to provide a image processor architecture which is capable of doing real-time kernel convolution as well as other image enhancing or processing operations.
A still even further object of the present invention is to provide a real-time image processing chip without the need for external buffering and data shuffling.
These and other objects are achieved by a matrix processor module, including data inputs receiving in sequence P words from a matrix, coefficient inputs receiving coefficients for matrix multiplication, cascade inputs for receiving summing information when the module is cascaded with other modules and an output. The module includes a plurality of multipliers connected in parallel to the data inputs either directly or through an ALU as well as being connected to a respective coefficient input for producing products PC. A summer is selectively connected to the multipliers by selected delays for providing a sum of the inputs received from the multipliers. A plurality of FIFO storage elements are selectively connected to the summer, the cascade input and one of the multipliers for storing M words. An adder is connected to the summer and the FIFO storage for adding inputs from the summer and the FIFO storage to provide an output Pc. A control is connected to the summer and the FIFO storage for controlling the configuration of the module to a first configuration which performs symmetrical kernel convolution or a second configuration which is capable of performing asymmetrical kernel convolution.
In the symmetrical kernel operation, only two coefficients are needed one being C1 which is the coefficient for the kernel except for the center coefficient and C2 being the center coefficient. Irrespective of the size of the kernel Q.times.R, only two multipliers and two coefficients are needed. The first multiplier multiplies the common coefficient C1 times the incoming word P and provides it to the summer, which is configured as a shifting accumulator of Q products representing the number of products in a row, wherein the accumulator value is P.sub.i C.sub.1 +P.sub.i-1 C.sub.1 . . . +P.sub.i-Q+1 C.sub.1. The summer is converted to a shifting accumulator by a time delay connecting the output of the summer to a input and delaying it one cycle and a second delay connected to the input of the summer from the multiplier with a time delay of Q cycles to be subtracted from the value of the summer. The output of the summer is connected to R-1 FIFO storages, each of which stores a respective row of sums and has a capacity of M or the number of words in the row of the matrix to which the kernel convolutions is being performed. The second multiplier multiplies the input P times the difference of the two coefficients to produce the product P(C.sub.2 -C.sub.1) and stores that in an additional R.sup.th FIFO storage element. The addition of the R FIFO storage element with the current output of the summer at the adder produces the kernel convolution PC at the output. The configuration control uses a plurality of multiplexers to select the appropriate routing of the information. For example, for a 3.times.3 kernel convolution of a 256.times.256 matrix, there will be three FIFOs one for the product of the difference of two coefficients and P word and two for the product of the common coefficient and P word and all of the FIFOs having a capacity of 256 words. The summer would include 3 products and the time delay between the output and the input would be 3 cycles.
For asymmetrical kernel convolution, the matrix processor module would be configured to have Q multipliers each connected in parallel to the input to multiply the input word by respective coefficients C.sub.1 through C.sub.Q. The outputs of the multipliers would be connected to the summer by varying degrees of FIFO delay increasing one delay per multiplier. A single FIFO is used and connected to the cascade input to store an input from a preceding module, wherein the FIFO storage has a storage capacity of M. The adder adds the output of the FIFO with the output of the summer to produce a sum for that module. In a general scheme there would be a matrix processor module for each row of the kernel. For example, for a 3.times.3 kernel of a 256.times.256 matrix, the FIFO storage element would have a capacity of 256 products and there would be three multipliers receiving three coefficients for the respective row and there would be a three cascaded matrix processor modules.
For a 5.times.5 matrix, there would be five multipliers for the five words in a row and five modules representing the five rows. Alternatively, a three multiplier module maybe used, wherein each row would include two three multiplier modules having their inputs connected in parallel to the input and arranged in columns. The cascade input of any module would be connected to the module of the preceding row within its column. A final adder would be provided to add the outputs of the two columns. The unused multipliers, for example one, would have a coefficient of zero. The time delays of the multiplier input to the summer would be 012 for a 3.times.3 and 01234 for a 5.times.5. Since the outputs are cascaded through the FIFOs, an appropriate 256 delay would be provided per row to the next row.
If the matrix to be kernel convoluted are multiples of a 256.times.256 matrix, each of the rows would include more than one module connected in cascade wherein the input coefficients of the extra modules per row are zero. Thus, for a 1024.times.1024 matrix, each row would include two matrix processor modules, wherein the second module would have zero coefficients and the products from the first module would be stored in the 256 FIFO of the second module such that the output of the second module would be a 1024 cycle delay to be cascaded into the first module of the next row. Each of the rows, except the last row would have the multiple modules.
Other objects, advantages and novel features of the present invention will become apparent from the following detailed description of the invention when considered in conjunction with the accompanying drawings.