This invention relates to signal processors, and more specifically, to high speed integrated circuit signal processors.
In the past, special purpose computers have been used extensively in computationally intensive algorithms when effective solutions to real-time signal processing problems were required. As general purpose computer costs came down and speeds increased, the need for specially designed computers declined; but, of course, there are always bigger problems to be tackled. Custom integrated circuit processors, employed as peripheral co-processors, can serve the same needs. When designed and used properly, they dramatically increase the performance of general purpose computer systems for a large range of computationally intensive programs.
Examples of custom real-time signal processing chips include various embodiments for performing the direct and inverse Discrete Cosine Transform (DCT) with applications in image coding. Such a chip is described, for example, by M. Vetterli and A. Ligtenberg in "A Discrete Fourier-Cosine Transform Chip", IEEE Journal on Selected Areas in Communications, Vol. SAC-4, No. 1, January 1986, pp. 49-61, and in a copending application of S. Knauer, filed on June 19, 1986 and bearing the Ser. No. 876,076.
The DTC, which is an orthogonal transform, is useful in image coding because images contain a fair amount of redundancy and, therefore, it makes sense to process images in blocks. A two dimensional DCT transform of such blocks normally yields mostly low frequency components, and ignoring the high frequency (low magnitude) components does little damage to the quality of the encoded image. Although there is a potential transmission and storage benefit from employing DCT to image encoding the computational burden is quite heavy. To work at video sampling rates requires a processing speed of about 6.4 million samples per second (assuming 15734 scan-lines per second and about 400 samples per line as necessary for NTSC). An eight-point DCT requires at least 13 multiplications and 29 additions. A two dimensional transform can be calculated by applying a one dimensional transform on the rows followed by a one dimensional transform on the columns. Consequently, for real-time image processing a DCT integrated circuit is required to perform 1.6 million eight-point transforms per second involving about 20 million multiplications and 47 million additions.
Another application for custom real-time orthogonal transform signal processor exists in systems that solve least-square problems. Such problems are pervasive in signal processing and linear programming. For example, the most intensive computational aspect of a linear programming problem using Karmarkar's algorithm is actually the solution of a least-squares problem at each iteration. This could be solved with a co-processor consisting of an array of orthogonal transform processors (each consisting of four multipliers, one adder and one subtractor), and the use of such a co-processor would diminish the running time of Karmarkar's software embodiments by an order of magnitude.
To achieve the high performance required in many applications, as illustrated above in connection with DCT image coding, one has to answer two questions. First, what are the basic building blocks involved in these operations, and, second, what are efficient VLSI structures for these building blocks. The answers to these two questions are strongly related, because without a good division of the algorithm one cannot obtain an efficient VLSI implementation, and without knowing how an efficient VLSI structure can be implemented, one cannot obtain optimal partitioning of the algorithm.
These difficulties are illustrated in U.S. Pat. No. 4,510,578 issued to Miyaguchi et al on Apr. 9, 1986, where a circuit is described for subjecting an input signal to an orthogonal transform. The circuit comprises a first stage of three memories operating in parallel and feeding eight constant coefficient multipliers. The output signals of the multipliers are applied to three adders, two of which are three input adders. The three outputs of the first stage are applied to three secondary complex multiplier stages, and each of the three complex multiplier stages comprises two memories, four multipliers and two adders. No effort is made in the Miyaguchi et al circuit to employ an architecture that is particularly fast but, rather, the accent seems to be on employing conventional modules (adders, multipliers) in a combination that achieves the desired transform in the most straightforward fashion.
In contrast, the considerations that must be borne in mind for a good integrated circuit realization relate not only to the number of multipliers and adders required but also to the size of those elements and the delays contributed by them. For example, in some embodiments adding and then multiplying increases the delay much more than multiplying and then adding.
It is an object of this invention to provide an orthogonal transform processor circuit that is particularly well adapted for integrated circuit realizations. It is another object of this invention to provide an orthogonal transform processor whose architecture permits maximum utilization of the speed capabilities of integrated circuits.