The present invention relates to integrated circuits and in particular, to integrated circuits with circuitry for performing single precision floating point arithmetic.
Calculates involving single precision floating point arithmetic arise in many different applications. Often, these are computation intensive applications that benefit greatly from high performance calculations. The applications include, for example, video frame generation and digital signal processing (DSP) tasks.
Many different programs are used in frame generation. For example, see references [19]-[23]. These programs are both complex and need high performance. They are created with high level procedural and object-oriented computer programming languages such as C, C++, and FORTRAN. Only the most performance critical portions of these programs are usually directly written in assembly/machine language targeting the underlying rendering engine hardware because of the prohibitive expense and difficulty of programming in assembly/machine language. Floating point arithmetic is popular in these programs because of its wide dynamic range and programming ease.
The need for performance improvements is large. For example, optimal video editing requires a frame to be generated every second. Real-time virtual reality needs up to 30 frames generated per second. In order to satisfy these needs, and other similar needs in other applications, current technology must improve tremendously. For example, performance improvements needed to satisfy these two industrial applications are speedups of 108,000xc3x97 for video editing (=30 hrs./framexc3x973600 seconds/hr) and 3,240,000xc3x97 for virtual reality (=30*Video Editing).
A similar situation exists in high performance Digital Signal Processing. A typical DSP application includes processing images, often collected from 2-D and 3-D sensor arrays over time to construct images of the interior of materials including the human body and machine tools. These multidimensional signal processing applications construct images from banks of ultra-sound or magnetic imaging sensors. This has similar performance requirements to frame generation. These applications have the goal of resolving features in a reconstruction/simulation of a 3-D or 4-D environment. (Note: 4-D here means a 3-D domain observed/simulated over time.) Feature resolution is a function of input sensor resolution, depth of FFT analysis which can be computationally afforded within a given period of time, control of round-off errors and the accumulation of those rounding errors through the processing of the data frames.
Fine feature resolution in minimum time leads to performing millions and often billions of arithmetic operations per generated pixel or output data point. The use of floating point arithmetic to provide dynamic range control and flexible rounding error control is quite common. Algorithmic flexibility is a priority, due to the continuing software evolution and the availability of many different applications. These differing applications often require very different software.
The application software development requirements are very consistent. In particular, most applications need numerous programs, mostly written in the procedural computer programming languages, C, C++ and FORTRAN (see references [11]-[18]). Use of machine level programming is restricted to the most performance critical portions of the programs.
Typical algorithms for performing the applications discussed above, have many common features such as a need for large amounts of memory per processing element, often in the range of 100 MB; a need for very large numbers of arithmetic calculations per output value (pixel, data point, etc.); a need for very large numbers of calculations based upon most if not all input values (pixel, data point, etc.); and, relatively little required communication overhead compared to computational capacity.
Many of these algorithms use calculations that require, for example, several relatively short vectors and calculations involving complex numbers. For example, some algorithms include calculating complex valued functions such as X=(az+b)/(cz+d), wherein a, b, c, d, and z are all complex floating point numbers. The algorithms define A0, b0, c0, d0, z0 and X0 as the real components and correspondingly, a1, b1, c1, d1, z1 and X1 as the imaginary components. The calculation prior to entry into a floating point division circuit proceeds in two multiply-accumulate passes. In the first pass, the following are calculated:
A0=a0*z0xe2x88x92a1*z1+b0
A1=a0*z1+a1*z0+b1
B0=c0*z0xe2x88x92c1*z1+d0
B1=c0*z1+c1*z0+d1
In the second pass, the results of B0 and B1 are fed back into multiplier-accumulators (discussed later) as shared operands to generate:
C0=A0*B0xe2x88x92A1*B1
C1=A1*B0+A0*B1
D=B0*B0+B1*B1
Finally, the division operations are performed:
X0=C0/D
X1=C1/D
The circuitry disclosed herein optimizes the performance of calculation of the A, B, C and D formulae above.
Some of the major advances relevant to this invention relate to the development of high speed micro-processors and DSP engines. High speed micro-processors and DSP engines possess great intrinsic algorithmic flexibility and are therefore used in high performance dedicated frame rendering configurations such as the SUN network that generated Toy Story. See reference [1].
The advent of the Intel Pentium(trademark) processors brought the incorporation of many of the performance tricks used in the RISC (Reduced Instruction Set Computing) community. xe2x80x9cAppendix D: An Alternative to RISC: The INTEL 80xc3x9786xe2x80x9d in reference [30] and xe2x80x9cAppendix: A Superscalar 386xe2x80x9d in reference [31] provide good references on this. xe2x80x9cAppendix C: Survey of RISC Architecturesxe2x80x9d in reference [30] provides a good overview. However, commercial micro-processor and DSP systems are severely limited by their massive overhead circuitry. In modern super-scalar computers, this overhead circuitry may actually be larger than the arithmetic units. See references [30] and [31] for a discussion of architectural performance/cost tradeoffs.
High performance memory is necessary but not sufficient to guarantee fast frame generation because it does not generate the dataxe2x80x94it simply stores it. It should be noted that there have been several special purpose components proposed which incorporate data processing elements tightly coupled on one integrated circuit with high performance memory, often DRAM. However these efforts have all suffered limitations. The circuits discussed in [32] use fixed point arithmetic engines of very limited precision. The circuits discussed in [32] are performance constrained in floating point execution, and in the handling of programs larger than a single processor""s local memory.
Currently available special purpose components are not optimized to perform several categories of algorithms. These components include
1. Image compression/decompression processors.
a. These circuits, while important, are very specialized and do not provide a general purpose solution to a variety of algorithms.
b. For example, such engines have tended to be very difficult to efficiently program in higher level procedural languages such as C, C++ and FORTRAN.
c. The requirement of programming them in assembly language implies that such units will not address the general purpose needs for multi-dimensional imaging and graphical frame generation without a large expenditure on software development. See references [24] and [25].
2. Processors optimized for graphics algorithms such as fractals, Z-buffers, Gouraud shading, etc.
a. These circuits do not permit optimizations for the wide cross-section of approaches that both graphics frame generation and image processing require.
b. See references [26]-[29].
3. Signal processing pre-processor accelerators such as wavelet and other filters, first pass radix-4, 8 or 16 FFT""s, etc. 1-D and 2-D Discrete Cosine Transform engines.
a. These circuits are difficult to program for efficient execution of the wide variety of large scale frame generation tasks.
4. Multiprocessor image processors.
a. These processors include mixed MIMD and SIMD systems that are ill-suited to general-purpose programming. See reference [24] and [41] to [43].
b. These processors also include VLIW (Very Long Instruction Word) SIMD ICs such as Chromatic""s MPACT ICs.
c. Such ICs again cannot provide the computational flexibility needed to program the large amount of 3-D animation software used in commercial applications, which require efficient compiler support. See references [34] and [39].
5. Multimedia signal processors.
a. These processors also have various limitations, such as lack of floating point support, lack of wide external data memory interface, access bandwidth to large external memories, deficient instruction processing flexibility and data processing versatility, and reliance on vector processors which are inefficient and difficult to program for operations without a very uniform data access mechanism concerning accumulating results. See, for example, references [35]-[38].
Following is a list of some of central problems to achieving high performance multiplication and add/subtraction. Of course, this list is not exhaustive, but provides some of the barriers that currently available devices have run up against.
Addition and subtraction typically involve the propagation of carry information from each digit""s result to all higher digits. If there are two n bit operands, the time it takes to perform such operations is 0(n). As operands get larger, more time is required by such circuitry.
By the time of the ILIAC III(mid 1970""s), local carry propagate addition algorithms had been discovered. Such algorithms have the advantage of determining a local carry bit from examination of only a few neighboring bits of the operands. The consequence of using one of these schemes is that adds and subtract essentially take a constant amount of time, no matter how large the operands become. Several such algorithms have been discovered and are now considered standard tools by practitioners, embodied in circuitry such as carry save adders and redundant binary adders.
There are several known multiplication algorithms of varying levels of utility. A popular algorithm is known as Booth multiplication algorithm. The basic idea of Booth""s algorithm is to skip over individual iterations in an iterative shift-and-add implementation of the multiplication. The algorithm skips over 0 bits in the multiplier, and it skips over sequences of 1 bits. The idea is that a sequence of N 1xe2x80x2 in the multiplier is numerically equal to 2Nxe2x88x921, so the effect of multiplying by this group of 1""s is the same a as subtraction in the least significant position, followed by an addition N positions to the left.
In a Booth algorithm, the B multiplicand may be decomposed into overlapping collections on k+1 bits of successively greater order B[0:k], B[k:2*k], . . . For example, in a 4-3 Booth Multiplication Algorithm k is equal to 3. More details of Booth Algorithms will be well understood by one of skill in the art.
A wire is a mechanism for sharing the state between multiple nodes of a circuit. The state is a finite alphabet based upon certain physical conditions including but not limited to: electrical voltage, current, phase, spectral decomposition, and photonic amplitude. Symbols are the individual elements of an alphabet. Measured value ranges of the relevant physical conditions typically encode symbols. The most commonly used alphabet is the set {0,1l}, the binary symbol set. Binary systems using all of the above schemes exist. Other commonly used alphabets include 3 symbol alphabets, e.g., {0,1,2} also alternatively denoted as {xe2x88x921,0,1}, multiple binary alphabets, e.g., {00,01,10,11}, etc. There are other alphabets in use. A wire may be embodied, for example, as a strip of metal (e.g., in an integrated circuit or on a circuit board), an optical fiber, or a microwave channel (sometimes referred to as a microchannel).
A wire bundle is a collection of one or more wires.
A bus is a wire bundle possessing a bus protocol. The bus protocol defines communication between circuits connected by the wire bundle. A bus will typically be composed of component wire bundles wherein one or more of the component wire bundles will determine which connected components are receiving and which are transmitting upon one or more of the other component wire bundles.
Floating point notation includes a collection of states representing a numeric entity. The collection includes sub-collections of states defining the sign, mantissa and exponent of the represented number. Such notations non-exclusively include binary systems including but not limited to IEEE standard floating point and special purpose floating point notations including but not limited to those discussed in the references of this specification. A floating point notation non-exclusively includes extensions whereby there are two sub-collections each containing the sign, mantissa and exponent of a number as above. The numeric representation is of an interval wherein the number resides. A floating point notation additionally includes non-binary systems, wherein the mantissa and exponent refer to powers of a number other than 2.
A programmable finite state machine is a machine which includes a state register, possibly one or more additional registers wherein state conditions, quantities, etc. reside, and a mechanism by which the state register, additional registers, external inputs generate the next value for the state register and possible additional registers.
A Single Instruction Multiple Datapath (SIMD) architecture executes the same instruction on more than one datapath during the same instruction execution cycle. A typical extension to this basic concept is the incorporation of xe2x80x9cstatus flag bitsxe2x80x9d associated with each datapath. These flags enable and disable the specific datapath from executing some or all of the globally shared instructions.
SIMD architectures are optimal in situations requiring identical processing of multiple data streams. The inherent synchronization of these data streams produce advantages by frequently simplifying communications control problems. SIMD architectures generally become inefficient whenever the data processing becomes dissimilar between the datapaths.
SIMD architectures require a relatively small amount of instruction processing overhead cost because there is only one instruction processing mechanism, shared by the datapaths. The instruction processing mechanism has an instruction fetching mechanism. The datapath collection typically needs only one instruction memory.
A Multiple Instruction Multiple Datapath (MIMD) architecture executes distinct instructions on different datapath units. The fundamental advantage of this approach is flexibility. Any data processing unit can execute its own instruction, independently of the other data processing units. However, this flexibility has added costs. In particular, each data processing unit must possess its own instruction fetching, decoding, and sequencing mechanisms. The instruction fetching mechanism frequently possesses at least a small memory local to the data processor. This local memory is often a cache.
A (Very) Long Instruction Word Processor (VLIW and LIW, respectively) is a class of architectures whereby is a single instruction processing mechanism contains a program counter capable of common operations such as branches on condition codes, and multiple instruction fields that independently control datapath units. In these architectures, the datapath units often are not identical in structure or function.
An Array Processor is defined as an LIW or VLIW instruction-processing architecture having multiple datapath units arranged into one or more collections. In embodiments of the present invention, as will be described, the datapath collections receive and may act upon a common operand received via a common operand bus having a transmitting unit; each datapath receives one or more additional operands from memory; each datapath collection contains an instruction memory possessing instruction fields that control the operation of the program controlled elements within; each datapath unit contains one or more multiplier/accumulators (MACs); and each MAC possesses a multiplicity of accumulating registers.
A Multiplier-Accumulator is an arithmetic circuit simultaneously performing multiplication of two operands and addition (and possibly subtraction) of at least one other operand.
The fast Fourier transform is a highly optimized algorithm for generating the spectrum of a signal. See the relevant chapters of references [11], [12], and [15] for thorough discussions of the various involved topics.
A vector processor is an architecture designed to operate exclusively on vectors. Typically, a vector processor is deeply pipelined. There is a large literature devoted to vector processing. See references [46]-[53]. Appendix B, xe2x80x9cVector Processorsxe2x80x9d in reference [30] provides an overview on the subject.
The present invention provides an arithmetic engine for multiplication and addition/subtraction that avoids the above-described limitations with regard to computation for video frame rendering and DSP tasks. This invention addresses such performance requirements in a manner more efficient than prior circuitry both in terms of silicon utilization for the performance achieved and in terms of ease of coding.
According to an aspect of the present invention a digital logic circuit which performs multiple floating point operations concurrently is provided. The logic circuit comprises a shared operand generator that receives a first operand and outputs a result that is a fixed function of the first operand, and an arithmetic circuit. The arithmetic circuit includes a plurality of multiply circuits, each of the plurality of multiply circuits having circuitry to calculate partial products from multiplying a second operand with the the first operand and with the results of the shared operand generator. It also includes circuitry to selectively calculate a sum of the partial products and a third operand and produce an arithmetic result.
According to another aspect of the present invention, a device for performing multiply/accumulate operations based upon a multiplication algorithm utilizing successive small bit multiply operations is provided. The device includes a plurality of small bit multipliers, wherein each of the plurality of small bit multipliers operates to perform the multiplication algorithm on a first input and one of the bits of a second input to calculate a plurality of partial products. It also includes an adder tree for adding the partial products to calculate a multiply/accumulate result.
In yet another aspect of the present invention, an arithmetic circuit for performing floating point operations is provided. Two strips of consecutive logic cells are provided to operate on the mantissas of two floating point operands. A comparator is provided to compare the exponents of the two operands. If the exponent of the first operand is larger than the exponent of the second operand, then the two strips of consecutive logic cells are arranged with the second strip being consecutively after the first strip. A carry signal from the most significant logic cell of the first strip is coupled to the least significant logic cell of the second strip. If the exponent of the second operand is larger than the exponent of the first operand, then the two strips are arranged in the opposite order consecutively so that the carry bit of the most significant logic cell of the second strip may be coupled to the least significant logic cell of the first strip.
A comparator determines the difference in the exponents and shifts the second operand with respect to the first operand accordingly. By this arrangement, the first operand is coupled to the first strip and the second operand is shifted between the first strip and the second strip. The overlapping bits are operated upon by the logic cells, while the non-overlapping bits are passed along without change by putting a zero value as the second operand.