The present invention relates to circuits. More particularly, the present invention relates to high-performance integrated circuitry suited to the video frame generation task and digital signal processing (DSP) tasks.
The central problem of the graphics industry is to generate frames. Each frame is a rectangular array of pixels, often involving over 1 million pixels. Animation and in particular, 3-D animation, usually requires many millions to many billions of calculations to generate one pixel. Similarly, graphics applications (such as medical imaging) require the generation of one or more frames, again, often in animation sequences.
Consider the example of this need in the 1995 effort by Pixar generating the motion picture Toy Story. There are 110,000 frames in the motion picture. Pixar used 87 dual processor 100 MHz Sparc 20's and 30 quad processor 100 MHz Sparc 20's. This is a total of 294 CPU's. There was an average of 96 Megabytes of RAM per CPU and each processing node had a local 3-5 gigabyte local disk drive. The disk farms and their servers are not relevant to this patent, but were quite large. The entire motion picture took 46 days to compute with an average frame taking between 1 to 3 hours of Sparc CPU processor time. See reference [1].
While Toy Story was not photo-realistic it did represent a breakthrough. It was the first full length, feature motion picture entirely generated by 3D computer animation technology. Photo-realism for such a film requires at least a factor of 10 more computing complexity. Assume the photo-realistic frames compute in 30 hours.
There are many different programs used in frame generation. See references [19]-[23]. These programs are both complex and need high performance. They are created with high level procedural and object-oriented computer programming languages such as C, C++, and FORTRAN. Only the most performance critical portions of these programs might be directly written in assembly/machine language targeting the underlying rendering engine hardware because of the prohibitive expense and difficulty of programming in assembly/machine language. Floating point arithmetic is popular in these programs because of its wide dynamic range and programming ease.
The need for performance improvements is large. Optimal video editing requires 1 frame every second. Real-time virtual reality needs up to 30 frames generated per second. The performance improvements needed to satisfy these two industrial applications are speedups of 108,000× for video editing (=30 hrs./frame×3600 seconds/hr) and 3,240,000× for virtual reality (=30*Video Editing).
A similar situation exists in high performance Digital Signal Processing. The typical requirement includes processing images, often collected from 2-D and 3-D sensor arrays over time to construct images of the interior of materials including the human body and machine tools.
These multidimensional signal processing applications construct images from banks of ultra-sound or magnetic imaging sensors. This has similar performance requirements to frame generation. These applications have the goal of resolving features in a reconstruction/simulation of a 3-D or 4-D environment. (Note: 4-D here means a 3-D domain observed/simulated over time.) Feature resolution is a function of input sensor resolution, depth of FFT analysis which can be computationally afforded within a given period of time, control of round-off errors and the accumulation of those rounding errors through the processing of the data frames.
Fine feature resolution in minimum time lead to performing millions and often billions of arithmetic operations per generated pixel or output data point. The use of floating point arithmetic to provide dynamic range control and flexible rounding error control is quite common. Algorithmic flexibility is a priority, due to the continuing software evolution and the availability of many different applications. These differing applications often require very different software.
The application software development requirements are very consistent. In particular, most applications need numerous programs, mostly written in the procedural computer programming languages, C, C++ and FORTRAN (see references [11]-[18]); and the use of machine level programming is restricted to the most performance critical portions of the programs.
The target algorithms display the following common features: a need for large amounts of memory per processing element, often in the range of 100 MB; a need for very large numbers of arithmetic calculations per output value (pixel, data point, etc.); a need for very large numbers of calculations based upon most if not all input values (pixel, data point, etc.); and relatively little required communication overhead compared to computational capacity.
Support for high resolution graphics has developed over the last 30 years. Preliminary efforts in the 1960's and early 1970's such as seen in reference [40] created graphics computer systems with a minimum of specialized hardware. There was little or no thought at that time to VLSI (Very Large Scale Integration) integrated circuits (ICs).
The support of the graphics industry by semiconductor devices has focused on the following issues:                A. Support of I/O devices, with the support of the screen display device consuming the bulk of the effort. This has lead to the development of specialized Integrated Circuits to control the screen. See references [2].        B. Development of high speed micro-processors and Digital Signal Processors.        C. Development of high speed and high density memory devices, particularly DRAMs, VRAMs, etc.        D. Special purpose components aimed at real-time image processing and frame generation applictions.        
These efforts have fundamental limitations, as is described below.
A. Display device controllers are limited in that each frame is generated by a fixed execution structure machine within a specific amount of time. Thus, the variation of frame algorithms is necessarily limited.
B. High speed micro-processors and DSPs possess great intrinsic algorithmic flexibility and are therefore used in high performance dedicated frame rendering configurations such as the SUN network that generated Toy Story. See reference [1]. The advent of the Intel Pentium™ processors brought the incorporation of all the performance tricks of the RISC (Reduced Instruction Set Computing) community. “Appendix D: An Alternative to RISC: The INTEL 80x86” in reference [30] and “Appendix: A Superscalar 386” in reference [31] provide good references on this. “Appendix C: Survey of RISC Architectures” in reference [30] provides a good overview.
However, commercial micro-processor and DSP systems are severely limited by their massive overhead circuitry. In modern super-scalar computers, this overhead circuitry may actually be larger than the arithmetic units. See references [30] and [31] for a discussion of architectural performance/cost tradeoffs.
C. High performance memory is necessary but not sufficient to guarantee fast frame generation because it does not generate the data—it simply stores it.
D. There have been several special purpose components proposed which incorporate data processing elements tightly coupled on one integrated circuit with high performance memory, often DRAM. However these efforts have all suffered limitations. The circuits discussed in [32] use fixed point arithmetic engines of very limited precision. The circuits discussed in [32] are performance constrained in floating point execution, and in the handling of programs larger than a single processor's local memory.
The proposed special purpose components are optimized to perform several categories of algorithms. These components include:                D1. Image compression/decompression processors. These circuits, while important, are very specialized and do not provide a general purpose solution to a variety of algorithms. For example, such engines have tended to be very difficult to efficiently program in higher level procedural languages such as C, C++ and FORTRAN. The requirement of programming them in assembly language implies that such units will not address the general purpose needs for multi-dimensional imaging and graphical frame generation without a large expenditure on software development. See References [24] and [25].        D2. Processors optimized for graphics algorithms such as fractals, Z-buffers, Gouraud shading, etc. These circuits do not permit optimizations for the wide cross-section of approaches that both graphics frame generation and image processing require. See references [26]-[29].        D3. Signal processing pre-processor accelerators such as wavelet and other filters, first pass radix-4, 8 or 16 FFT's, etc. 1-D and 2-D Discrete Cosine Transform engines. These circuits are difficult to program for efficient execution of the wide variety of large scale frame generation tasks.        D4. Multiprocessor image processors. These processors include mixed MIMD and SIMD systems that are ill-suited to general-purpose programming. See reference [24] and [41] to [43].                    These processors also include VLIW (Very Long Instruction Word) SIMD ICs such as Chromatic's MPACT ICs. Such ICs again cannot provide the computational flexibility needed to program the large amount of 3-D animation software used in commercial applications, which require efficient compiler support. See references [34] and [39].                        D5. Multimedia signal processors. These processors also have various limitations, such as lack of floating point support, lack of wide external data memory interface access bandwidth to large external memories, deficient instruction processing flexibility and data processing versatility, and reliance on vector processors which are inefficient and difficult to program for operations without a very uniform data access mechanism concerning accumulating results. See references [35]-[38].        
What is needed is a computational engine that avoids the above-described limitations with regard to computation for video frame rendering and DSP tasks.