The present invention relates generally to matrix processors and more specifically to matrix processors of images.
Modern image processing frequently requires that a massive number of repetitive computations be performed on a single array of picture elements (pixels) before they are displayed. The purpose of these computations is to enhance the contents of a single picture frame, or modify them to suit a specific image analysis objective. The two most common classes of pixel operations are the pixel-point operations and pixel-group operations. Pixel-point operations are considerably less computation-intensive than the pixel-group operations; they usually call for simple arithmetic or logic operations to be performed on each pixel of the frame. On the other hand, pixel-group operations usually involve a square or rectangular window (convolution kernel) of neighboring pixels upon which a set of arithmetic operations is to be performed. This set of operations is repeated for each pixel of the frame.
The value of the pixel usually indicates the intensity of a specific RGB video component for chromatic applications, or a gray scale value of monochromatic applications (such as radar signal processing or image recognition).
It is the pixel-group operations that have, to date, established the limits of real-time image processing speeds. The most common pixel-group operation in image processing is a so-called spatial convolution, i.e. a process of multiplying selected neighboring pixels by a set of values called a convolution coefficient kernel followed by the summation of the results. For instance a typical application, a set of, say, 9 pixels is arranged as a 3.times.3 matrix: ##EQU1## where x and y indicate pixel's coordinates in the frame array. Each pixel's K-bit value (where K usually lies in the 6-12 bit range) of such a 3.times.3 array is then multiplied respectively by an M-bit coefficient value (where M can be typically 6-16 bit value) from the 3.times.3 coefficient kernel: ##EQU2## The matrix above is called the convolution kernel, and each of its elements is referred to as convolution coefficient. Finally, after all 9 multiplications are completed, their results are then summed to yield the new value of the pixel in the location x,y: ##EQU3##
The new value of the pixel NEWP(x,y) usually corresponds to the modified value of its amplitude (gray scale). The entire process described above is called "2-D spatial filtration" or "2-D spacial convolution" and corresponds to a discrete two-dimensional filtration of the image in the time domain. Depending on the set of values A . . . I picked for the convolution kernel, a number of image processing functions can be accomplished. In particular, operations such as image smoothing, edge detection or extraction, or contrast enhancement can be accomplished. If all kernel coefficients except for the middle one (E) are picked equal to each other, or all nine of them are equal, the kernel is referred to as symmetrical. This is a common form of the kernels used in modern image processing. If three or more coefficients in the kernel differ from each other the kernel is called asymmetrical.
Although the use of larger kernels, such as 5.times.5, 7.times.7 or even 15.times.15, is even more desirable (since it increases the bandwidth of convolution), the amount of computations involved in convolution with such large kernels is often prohibitive for most applications. Since industry standard frame array sizes vary from 256.times.256 pixels to 4096.times.4096, the number of multiplications and additions which have to be performed for a single frame convolution varies from approximately 600,000 to almost 160 million per frame. Consequently, if the frames are to be convolved in real time (i.e. processed at the same rate, or faster, than they are acquired and digitized), the total frame convolution time puts an obvious restriction on the image acquisition time. Thus, for example, if the industry standard medium resolution image of 512.times.512 is to be acquired and convolved using 3.times.3 kernel, almost 2.5 million multiplications and additions will have to be performed. If a single eight-bit multiply/accumulate operation is assumed to require 50 nanoseconds using off-the-shelf multiplier-accumulater (MAC), the total amount of time required to complete a single frame convolution will be 0.125 seconds. This, in turn, would imply that the maximum frame repetition rate would be limited to 8 Hz, a rate too slow for most industrial and commercial applications typically requiring at least a 30 Hz frame repetition rate.
Consequently, most modern processors do not offer real-time 3.times.3 convolution capabilities. On the other hand, in most industrial and military applications, the frame repetition rates vary from 30 Hz (interlaced NTSC standard) to as fast as 400 Hz. This implies that for 512.times.512 pixel arrays total frame convolution times in the millisecond range are needed. In practice, such convolution speeds have rarely been accomplished and only in board-level designs. ECl-based designs can meet such requirements.
Thus, it is an object of the present invention to provide a matrix processor capable of real-time processing.
Another object of the present invention is to provide a matrix processor which can do real-time convolution of 3.times.3, 5.times.5 and other kernels.
Still a further object of the present invention is to provide an image processor capable of doing kernel convolutions of matrix from N.times.M to multiples of that array.
These and other objects are obtained by a system for Q.times.R kernel convolution of a M.times.N array of words. The system includes an input receiving the words P from the M.times.N array and coefficient storage for storing the coefficient C.sub.1 through C.sub.Q.times.R. A matrix of Q.times.R multipliers each have a first input for receiving a word Pi and a second input for receiving a dedicated coefficient Cj for producing a product PiCj. An adder is connected to the output of the plurality of multipliers for adding the products to provide an output convolution at a first or product output.
A plurality of buffers connect the word inputs to respective multipliers for storing one or more words and delaying the input of the word to a respective multiplier as a function of its position in the Q.times.R matrix and the row length M of the M.times.N array. The input structure is connected to the first multiplier of the row without a buffer. The multipliers in the matrix have a row input with the first row being connected to the input structure and the other rows' inputs are connected to the row input of a preceding row by a row buffer having M stages. Each of the multipliers within a specific row is connected to the row input by a buffer having a stage for each position it is spaced from the first position. Alternatively, the row buffer would have M-R+1 stages and each of the multipliers in a row would be connected to the preceding multiplier by a single stage buffer.
In either embodiment, the row buffer is programmable for different values of M. This allows the system to handle different values of word arrays without modification of multiplier structure. A cascade output as well as the product output may be provided representing the total delay or number of stages through the Q.times.R matrix. Depending upon the embodiment, it is either out of the last multiplier or out of the last row buffer.
The input may be through an arithmetic logic unit and through programmable stages of delay. A second input or cascade input is provided and is connected to the rows other than the first row through a multiplexer. The other input to the multiplexer is from the row buffers. The multiplexer would provide that the input word be connected to the first row with the other rows receiving either their inputs from the row buffers or from the cascade inputs. This allows versatility of the base kernel Q.times.R to be used in a structure for larger kernels or larger word arrays. A summer for each row R provides a sum of the products of its respective row to an output adder. The output adder provides the output convolution Pcout. The second or cascade input may also provide sums from other kernel stages to the output adder. A pair of coefficient stores are provided and a multiplexer chooses which store is being provided as an input to the multiplier matrix.
The system is converted to a (Q.times.R).times.1 convolution by programming the row buffers to Q stages. Arrays larger than M.times.N may be handled by a single system for Q.times.R kernel convolution by providing additional external row buffers connected to the second inputs. Alternatively a plurality of Q.times.R kernel convoluting systems may be cascaded. For different size kernels other than Q.times.R, the basic Q.times.R convoluting system maybe cascaded and selective coefficients set to zero.
Where the convolution kernel is larger than the convolver cell, the convolver cells may be arranged in a group of rows and columns. The cascade input of each convolver is connected to a product output of a previous convolver in its respective row. The cascade input of the first convolver of each row is connected to a product output of the last convolver of a previous row. The data input of each convolver is connected to a cascade output of a previous convolver in its respective column. The input of the convolvers of the first row are connected to the data input of the first convolver by appropriate delays. The delays may be external or part of the internal programmable input delay stages. The appropriate convolution kernel as well as variation in the matrix of the input can be controlled using appropriate coefficient to the individual convolvers.
Other objects, advantages and novel features of the present invention will become apparent from the following detailed description of the invention when considered in conjunction with the accompanying drawings.