The present invention is generally directed to a system for processing digital data in which data are processed in portions that are smaller than the word size, the size of the portions being optimally selected to maximize throughput efficiency, as that term is defined herein. More particularly, the present invention is directed to digital signal processing systems which are neither fully parallel nor fully serial in their architectures, but rather exhibit an intermediate architecture selected on the basis of optimizing a measure of performance based upon speed and circuit size.
In fully parallel (or word-parallel) digital signal processing architectures, all bits of a data word, n in number, are processed simultaneously by the circuitry. This architecture has the advantage of relatively high processing speed, but suffers from the disadvantage that fully parallel architectures for each bit of a word replicate circuit elements and interconnections between elements, each of which replications tends to consume a commensurate additional amount of die area in a monolithic integrated circuit. Interconnections between monolithic integrated circuits for parallel data are multi-wire and a considerable number of interconnection terminals or "pins" must be provided for each integrated circuit to implement those multi-wire connections.
On the other hand, fully serial digital signal processing architectures process one bit at a time in each clock cycle. These circuits have the advantage of simplicity, ease of design and, most importantly, they require minimal amounts of circuitry and so take up only a small amount of die area in a monolithic integrated circuit. Also single-wire interconnections between monolithic integrated circuits are made possible by serial digital signals, which is important when the restrictions upon the number of interconnection terminals or "pins" available for such connections are pressing. Within a monolithic integrated circuit the single-conductor interconnections between circuit elements tend to appropriate less chip area than the multi-conductor interconnections between elements that characterize fully parallel architectures.
Serial architectures also tend to exhibit a substantial amount of latency. That is, because of the serial design, a relatively large number of clock cycles can elapse between the time that an input bit is received and the time that output information related to the input bit is provided by the circuitry. However, circuit speed is generally sufficiently fast once the latency period has elapsed. Also, when a number of serial computations are to be performed in a data-flow pipeline, later computations can begin before earlier ones finish, which tends to reduce overall latency in the system. Accordingly, throughput is not so low as to preclude utilization of this architecture. The main advantage of serial computation is the need for only a small area for the processing elements and their electrical interconnections. The drawback, however, is that throughput is often lower than otherwise desired. Equivalent throughput can often be approached by more traditional non-pipelined Von Neumann architectures.
A widely used fully serial architecture employs bit-serial signals in which a serial stream of bits describes a succession of data words bit by bit, in order of increasing significance, where those data words represent two's complement numbers. This serial stream of data bits is accompanied by a signal indicating when one data word finishes and another commences, which signal can be a signal that is a ONE when the most significant bit of a data word occurs in the serial stream of data bits and that is otherwise a ZERO.
Data-flow pipeline architectures are recognized as being appropriate to the implementation of a large class of algorithms such as those that appear in digital signal processing applications. There have been two major approaches to data flow architecture, namely fully parallel and fully serial implementations. These architectures are discussed broadly above. Both of them have been studied extensively.
Many algorithms, especially in the areas of digital signal processing and graphics applications, have a constant throughput and can be performed with a constant latency. These algorithms are suitable for direct implementation in hardware using pipelined data-flow architectures. Unfortunately, many algorithms require more operations, and hence more individual operators than can be accommodated on a single very large scale integrated circuit (VLSI circuit) using fully parallel arithmetic or logic. On the other hand, bit-serial systems often do not provide a sufficiently high throughput. Furthermore, the structure of many algorithms makes it difficult to avoid these problems by decomposing the data processing so as to dispose different portions of the circuitry on separate integrated circuits.
Fully-parallel computational elements have been one of the main objects of study in computer arithmetic. Even with the advent of VLSI, fully-parallel computational elements are not well suited to data-flow architectural treatment, however, because their replicated digital hardware causes a tendency towards excessive size (as measured with respect to utilization of chip area). Furthermore, the multi-conductor interconnections within an integrated circuit are difficult to route unless the die size is allowed to be larger than one would wish.
Nonetheless, much work has been done on pipeline optimization for flow graphs of parallel computational elements. These aspects have been described in the works of Leiserson and others. These works include Digital Circuit Optimization by C. E. Leiserson, F. M. Rose and J. B. Saxe (MIT Report 1982), Optimizing Synchronous Systems by C. E. Leiserson and J. B. Saxe (Proceedings of the 22nd Annual Symposium on the Foundations of Computer Science, 1981), and in Models for VLSI Circuits by F. M. Rose (MIT Master's Thesis, 1982). Work on pipeline optimization for the flow graph organization of parallel computational elements is also described in the article Sehwa: A Program for Synthesis of Pipelines by Nohbyung Park and Alice Parker (IEEE Proceedings of the 23rd Design Automation Conference, 1986). Usually, however, parallel computational operators are used in a different architecture where they are time shared. Sharing of the operators decreases the throughput of the circuit, however. For example, see the article The VLSI Design Automation Assistant: Prototype System by T. J. Kowalski and D. E. Thomas (Proceedings of the 20th Design Automation Conference, June 1983, pages 479-483).
Bit-serial computational models have also received attention. In particular, Jackson et al. and later Lyon have proposed a methodology which has essentially been followed for the design of at least three "silicon compilers". In this regard, see An Approach to the Implementation of Digital Filters by Leland B. Jackson, James F. Kaiser and Henry S. McDonald (IEEE Transactions on Audio Electronics, Vol. AU-16, No. 3, September 1968, pages 413-421) and the article A Bit-Serial VLSI Architectural Methodology for Signal Processing by Richard F. Lyon (VLSI 81, Academic Press, 1981).
In connection with fully-parallel computation in data flow architectures, a technique known to designers (particularly those engaged in the design of digital filters) is to employ plural-path networks for "plural-phase" or "polyphase" data processing. See the M. G. Bellanger, G. Bonnerot and M. Coudrese paper Digital Filtering by Polyphase Network: Application to Sample Rate Alteration and Filter Banks. (IEEE Transactions Acoustics and Speech Signal Processing, Vol. ASSP-24, No. 2, pages 109-114, April 1976). See also pages 79-98 of the R. E. Crochiere and L. R. Rabiner book Multirate Digital Signal Processing, copyright 1983 by Prentice-Hall, Inc., Englewood Cliffs, N.J. 07632. In plural-phase data processing a stream of digital words supplied at an original sample rate is considered to comprise a succession of cycles, each cycle containing a plurality p in number of successive words. The p words in each cycle are considered as separate phases of the cycle. These phases may be identified by the consecutive ordinal numbers zeroeth through (p-1).sup.th assigned in accordance with occurrence of the words representative of those phases in the cycle. Each word phase is used to form a separate sample stream, the sample rate of which is one-p.sup.th that of the original sample rate; and calculations are performed at the lower sample rate on each of the sample streams. The results of these plural-phase calculations are then combined to generate results at the original sampling rate. Plural-phase data processing permits a relatively high throughput rate for a system, while calculations can be performed at reduced rates.
Another technique that is used by digital circuit designers to slow the rates at which data processing needs to be done is a procedure known as "banking". An operator that is to process a stream of data at a higher throughput rate is simulated by parallelly processing segments of that data stream in a plurality, p in number, of operators operating at a lower throughput rate one-p.sup.th as fast as the higher throughput rate. Successive segments of the data streams are displaced one sample word from each other in the banking procedure. When banking is employed in transverse filtering, each segment of the data stream spans the number of sample words in the filter kernel. The same filter kernel weights each segment of data to determine each successive sample word of filter response, and the component filter responses parallelly generated at the lower throughput rate are then sequentially polled at the higher throughput rate to supply the complete filter response at that higher throughput rate.
The present invention is particularly useful to those designers who employ software and hardware tools generally described as being "silicon compilers". These tools permit designers to specify arithmetic and logical functions in a relatively high level language, such as C or FORTRAN or a special hardware description language, and permit them to use the silicon compiler system to generate a set of masks which are employed in the fabrication of VLSI circuits that operate to carry out the function specified. For example, such silicon compilers are described in VLSI Signal Processing: A Bit-Serial Approach by Peter Denyer and David Renshaw (Addison-Wesley Publishing Company, Inc., Reading, Mass., 1985). Still other relevant material pertaining to silicon compilers may be found in Digit-Pipelined Arithmetic as Illustrated by the Paste-Up System: A Tutorial by Mary J. Irwin and Robert M. Owens (Computer, April 1987, pages 61-73). Other relevant material concerning silicon compilers may be found in the article Custom Design of a VLSI PCM-FDM Transmultiplexor from System Specification to Circuit Layout Using a Computer-Aided Design System by Rajeev Jain et al. (IEEE Journal of Solid-State Circuits, Volume SC-21, No. 1, February 1986, pages 73-85) and in the article A Bit-Serial Silicon Compiler by Jeffrey R. Jasica et al. (Proceedings of the International Conference on Computer-Aided Design, ICCAD085, Santa Clara, Calif., pages 91-93, 1985).
S. G. Smith and P. B. Denyer in a paper titled Radix-4 Modules for High Performance Bit-Serial Computation (IEE Proceedings, Vol. 134, Pt. E. No. 6, Nov. 1987, pages 271-276) present an outline of a number of methods for increasing the throughput of bit serial architectures. Among the methods mentioned therein is the pairing of bit-serial bits for parallel computation as radix-four digits. In this same regard, attention is also directed to the paper titled Techniques to Increase the Computational Throughput of Bit-Serial-Architectures, by Smith et al. (Proceedings of ICASSP 87, page 543, April 1987).
The Smith and Denyer articles are interesting also in regard to the radix-four adders and multipliers they describe for processing dual-bit digits, which can be modified to accommodate multiple-bit digits. Digit-serial addition and subtraction for plural-bit digits are described by R. I. Hartley and P. F. Corbett in U.S. patent application Ser. No. 265,210 filed 31 Oct. 1988, entitled "DIGIT-SERIAL LINEAR COMBING APPARATUS" and assigned to General Electric Company. That application described structures for performing digit-serial comparison as well as programmed addition or subtraction, which structures can perform non-restoring division. Digit-serial multipliers suitable for plural-bit digits are known in the prior art. Such multipliers are also described by R. I. Hartley and P. F. Corbett in U.S. patent application Ser. No. 134,271 filed Aug. 15, 1988, entitled "BIT-SLICED DIGIT-SERIAL MULTIPLIER" and assigned to General Electric Company; and in U.S. patent application Ser. No. 231,937 filed Aug. 15, 1988, entitled "BIT-SLICED DIGIT-SERIAL MULTIPLIER", and assigned to General Electric Company.
Of interest is the Irwin and Owens article Digit-Pipelined Arithmetic as Illustrated by the Paste-Up System: A Tutorial (cited above) with regard to its description of architecture using two-bit-wide signed digits to describe each arithmetic word. Signed digits are used to permit the more significant digits of a word to be supplied first. Signed digits introduce undesirable redundancy into arithmetic words, inasmuch as each digit carries its own sign indication, rather than one bit in each arithmetic word providing sign indication for the entire word. The use of signed digits undesirably impairs "throughput efficiency", because handling the redundancy in the signed digits requires either an increase in digital hardware or a reduction in throughput rate as compared to non-redundant arithmetics. "Throughput efficiency" is a measure of the relative performance of integrated circuits, which measure includes as factors the throughput rate and the reciprocal of the area of the digital circuitry required to support a particular operation or set of operations, for a given set of integrated-circuit design rules.
The present inventors perceive that the use of arithmetics that use non-redundant plural-bit digits including multiple-bit as well as dual-bit digits greatly expands the range of design alternatives, lying between fully parallel and fully serial architectures, that are available to the integrated circuit designer. One can design systems, using a small digit size where high throughput is not so stringent a requirement and the space available on an integrated-circuit die for digit hardware is at a premium, and using a larger digit size where higher throughput rate is necessary. One can change digit size to adjust to the number of pins available for interconnection between integrated circuits or to solve routing problems for connections within an integrated circuit die.
The particular arithmetic favored by the inventors is a digit-serial arithmetic in which each word is a two's complement number, of n bits, n being a positive integer that is a multiple of another positive integer m. The submultiple of m, is the number of bits in each digit of the word. The digits of a word are successively supplied to data flow architecture in order of their significance, least significant digit first and most significant digit last. The order of bits within digits is prescribed according to the significance of the bits within its digit. The sign bit is the most significant bit of the word and is contained in the last digit of the word. The flow of digits is accompanied by another signal that indicates how the flow of digits may be partitioned into individual words.
While the indication may be furnished during the first digits of words, the inventors find it is preferable to furnish the indication during the last digits of words. Different digit-serial operations may be controlled during the first digits of words and during the last digits of words, respectively. It is usually more economical of hardware to derive the former indications from the latter indications by unit digit-interval delay than it is to derive the latter indications from the former indications by [(n/m)-1]-digit-interval delay. Bit-serial processing may be considered to be a special case of digit-serial processing, where digit size is one bit wide.